A novel clustering-based purity and distance imputation for handling medical data with missing values

Nowadays, people pay increasing attention to health, and the integrity of medical records has been put into focus. Recently, medical data imputation has become a very active field because medical data usually have missing values. Many imputation methods have been proposed, but many model-based imputation methods such as expectation–maximization and regression-based imputation based on the variables data have a multivariate normal distribution, which assumption can lead to biased results. Sometimes, this becomes a bottleneck, such as computationally more complex than model-free methods. Furthermore, directly removing instances with missing values has several problems, and it is possible to lose the important data, produce ineffective research samples, and cause research deviations. Therefore, this study proposes a novel clustering-based purity and distance imputation method to improve the handling of missing values. In the experiment, we collected eight different medical datasets to compare the proposed imputation methods with the listed imputation methods with regard to the results of different situations. In imputation measures, the area under the curve (AUC) is used to evaluate the performance of the imbalanced class datasets in MAR and MCAR experiments, and accuracy is applied to measure its performance of the balanced class in MNAR experiment. Finally, the root-mean-square error (RMSE) is also used to compare the proposed and the listing imputation methods. In addition, this study utilized the elbow method and the average silhouette method to find the optimal number of clusters for all datasets. Results showed that the proposed imputation method could improve imputation performance in the accuracy, AUC, and RMSE of different missing degrees and missing types.


Introduction
Nowadays, everyone is concerned about the state of their health and pays increasing attention to healthcare issues, with physicians confirming disease diagnosis from the relevant medical records of the patient. However, the medical records usually have missing values due to, for example, privacy of patients, evading medical examination, and avoiding medical treatment. Missing data have a limitation when interpreting research results because a missing data problem is typically used by removing the observation with missing values. This may exclude the important information that is of theoretical interest and increase the misclassification cost, such as type I and type II errors. In addition, the main drawbacks of removed instances with missing values will reduce the number of samples; the estimates will have larger standard errors and possible analysis bias. Hence, in medical research, data imputation has become an important research issue. Furthermore, medical data usually are imbalanced classes where the class distributions are not represented equally, which will produce a misclassification cost problem.
To handle missing values, many researchers have proposed various types of techniques, but many model-based imputation methods such as expectation-maximization and multiple imputation based on the variables data have multivariate normal distribution (Schafer 1997). More assumptions can lead to biased results, and the model becomes difficult to learn; sometimes, this is a bottleneck, such as computationally more complex than model-free methods. In addition, most of them have not implemented a complete evaluation of all missing degrees and missing types. The most common method is to remove the records with missing values, making it possible to lose important data, produce ineffective samples, and cause research deviations. In previous work on handling missing values, there are several common methods to replace the missing values (Amiri and Jensen 2016), such as the average of other complete values and average of same class values (class average imputation). After replacing the missing values, they usually have better accuracy than that of the original datasets (Donders et al. 2006), and these replaced missing values methods are usually called imputation methods (Ondeck et al. 2018a). However, a common imputation approach has a shortcoming of deleted missing values, potentially removing something important and causing research bias.
From the mentioned problems above, this study proposes a novel clustering-based purity and distance imputation method to improve the handling of missing values, which is a hybrid method and the attributes data do not need to follow a multivariate normal distribution. The proposed approach is combined k-mean with purity-based k nearest neighbors imputation (PkNNI) and combined clustering with distance-threshold nearest neighbors imputation (DNNI). The main difference between the two approaches (PkNNI and DNNI) and the proposed method is to find the optimal number of clusters and adapt the two kNN-related approaches to obtain the optimal results for the eight medical datasets. Medical data have the higher MDs; hence, the research works need to check a higher percentage of complete data (e.g., normally more than 50% of the total cases) and needed to impute less than three values per patient (Jerez et al. 2010a). Therefore, this study simulated a massive missing degree (in the traditional definition is 360%, but the MD definition of this paper is 20%) to handle missing values. In evaluating metrics of imputation methods, most previously used accuracy and RMSE, but medical data usually are imbalanced classes. Therefore, this paper applied accuracy, the area under the curve (AUC), and root-mean-square deviation (RMSE) to handle the problem of imbalanced classes.
In summary, the objectives and contributions of this study are listed as follows: (1) Propose two novel algorithms based on the combined clustering with purity-based k nearest neighbors imputation (PkNNI) and combined clustering with distance-threshold nearest neighbors imputation (DNNI) to estimate the missing values, (2) Apply the elbow and average silhouette methods to find the optimal number of clusters consistently, (3) Compare the proposed imputation methods with the listing imputation methods in different missing degrees and missing types, (4) Evaluate the performance of the proposed imputation method by using root-mean-square error (RMSE), classification accuracy, and AUC, and (5) Apply the proposed imputation method to medical data with missing values.
The rest of this paper is organized as follows: Sect. 2 describes the related work, including medical data imputation, missing values type, clustering technique, and imputation methods. Section 3 introduces the concept of the proposed method and the proposed procedure. Section 4 outlines experiments and results. The discussions and findings are provided in Sect. 5. At last, the conclusions are given in Sect. 6.

Related work
This section introduces the related literature, including medical data imputation, missing value types, clustering technique, and imputation method.

Medical data imputation
Medical data imputation is an active area of research because medical data are relevant to the health of patients and physician decision making, and therefore, it is more important than any other type of data. The physician needs to refer to the medical record of patients for diagnosing their patients, and if the medical records have some errors, it may lead to diagnostic errors. It is very hard to achieve a complete record of medical data because there are various reasons for data with missing values, such as patient privacy, human negligence, and medical equipment dysfunction.
Missing values are most common when clinical trials are carried out because medical data are from various clinical trials, field experiments, or any other traditional mechanism. Proper care must be taken to handle missing values suitably and accurately. Hence, there are many challenges of missing value imputation in medical data; these challenges are listed in the following. (1) A simple removal to handle missing values would be to discard the whole records, which essentially contains the missing value of an attribute, and this does not solve the problem. Measuring missing value incurs additional cost, whereas previously reported statistical methods result in reduced performance compared to when all variables are measured. (2) The higher missing degrees in the electrical health records have been previously reported from 20 to 80% (Kharrazi et al. 2014). Hence, the majority of the research works normally presented a higher percentage of complete data (e.g., normally more than 50% of the total cases) and needed to impute less than three values per patient (Jerez et al. 2010a). (3) The selection of the imputation method is a challenge, using the model-based or model-free method? (4) Evaluating imputation performance is also a challenge; most previously used accuracy and RMSE, but medical data usually are imbalanced classes.
In summary, missing medical values could face imputation, deletion, or classification of missing type. To avoid bias, classifying the pattern of missing values is an efficient step to select appropriate imputation methods to improve data consistency. Many imputation methods have been used in the medical field (Ondeck et al. 2018a;M€uhlenbruch et al. 2017;Enders 2017;Galan et al. 2017), such as multiple imputation, expectation-maximization imputation, k nearest neighbor imputation, and chained equation multiple imputation (MICE). These studies are briefly introduced as follows. Sterne et al. (2009) used MI in epidemiological andclinical research. Jerez et al. (2006) used artificial neural networks combined data imputation to prognosis breast cancer. García-Laencina et al. (2015) proposed using k nearest neighbor, mode, and expectationmaximization imputation for 5-year survival prediction of breast cancer patients with unknown discrete values. Pombo et al. (2015) combined data imputation and statistics to design a clinical decision support system; the next year, they also proposed a patient-oriented method of a pain evaluation system (Pombo et al. 2016) that produced tailored alarms, reports, and clinical guidance on the basis of collected patient-reported data, which was a clinical decision support systems.

Missing values type
In the real world, the collected datasets are usually incomplete because of reasons such as human negligence, equipment failure, network disconnection, data not originating from the same source, postprocessing errors, and collection phase having noise; we called these cases of non-structures missingness. Rubin (1976) reported that there are three types of missing values, Missing At Random (MAR), Missing Completely At Random (MCAR), and Missing Not At Random (MNAR). This study focuses on the non-structured missingness problem, which has three types: MAR, MCAR, and MNAR.
MAR type can be defined as a missing data point that does not depend on missing data but depends on some of the observed data, i.e., the missingness is conditional on another variable. MCAR type that if missingness does not depend on either the observed data points or missing data, that is, a missing data point, is completely random. MNAR type is when missing values in a variable are related to the values of the variable itself, even after controlling for other variables. MNAR can have two origins, missingness depending on attributes of the instance of other missing data or a missing element dependent on its own value (Rubin 1976).
In practice, there are two ways of handling missing values, one is marginalization (Wagstaff 2004) and the other is imputation. Marginalization is deleting or ignoring missing values, and it is the most common approach in handling missing values. Marginalization may cause research bias by deleting data records/instances with missing values; data imputation can maintain the original dataset and avoid the problem of research bias.

Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is the main task of exploratory data mining and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
Clustering is a machine learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data points that are in the same group should have similar properties (features), while data points in different groups should have highly dissimilar properties. Clustering is a method of unsupervised learning, which is a common technique for statistical data analysis used in many fields. In data science, we can use clustering analysis to gain valuable insights from the collected data and visualize which group data points fall.
There are many clustering methods, such as connectivity-based clustering (hierarchical clustering), centroidbased clustering, distribution-based clustering, and densitybased clustering. This study uses k-means clustering, which is centroid-based clustering, where clusters are represented by a central vector, and the center of the cluster may not necessarily be a member of the dataset. When the number of clusters is fixed to k, we find k cluster centers and assign the objects to the nearest cluster center so that the squared where C k denotes the kth cluster and W C k ð Þ is the error sum of squares (SSE) within the cluster.

Average silhouette method
In addition to calculating the SSE, another method of measuring the effect of clustering is the average silhouette method (Rousseeuw 1987). The silhouette coefficient measures the effect of clustering on the basis of the cohesion and dispersion of each data point(i); its equation is as follows: where a(i) denotes the average distance between data point (i) and other data points in the cluster; b(i) is the average distance between data point (i) and the data points of the other clusters, taking the minimum value; b i ð Þ À a i ð Þ j j means to take an absolute value for b i ð Þ À a i ð Þ; s i ð Þ is silhouette coefficient 0 s i ð Þ 1, which can be considered as a data point(i) if the indicator is appropriate within the cluster to which it belongs, s i ð Þ close to 1 means that the data are properly clustered, and when s i ð Þ = 0, it denotes that the number of clusters is 1.

Imputation method
This section introduced the related imputation methods, including simple, multiple, kNN family, and computational intelligence imputation.

Simple imputation
The simple imputation is common imputation methods (Ondeck et al. 2018a;Wagstaff 2004); these include zero imputation, average imputation, and class average imputation. The zero imputation (ZI) is the simplest imputation function that fills the missing values with zero. Average imputation (AI) (Wagstaff 2004) replaces a missing value with the averages of the corresponding attribute on the entire dataset. The class average imputation (CAI) or concept mean imputation replaces the missing value with the average of the attribute over all instances within the same class label.

Multiple imputation
Multiple imputation methods have been applied in medical studies (Zhang 2016;Ondeck et al. 2018b), including multivariate imputation by chained equation (MICE) and expectation-maximization imputation. Zhang (2016) introduced a multivariate imputation by chained equation (MICE) by using the R package in medical research. Ondeck et al. (2018b) used multiple imputation in arthroplasty research and presented the results of comparisons between the demographic characteristics of patients with and without missing preoperative albumin and hematocrit values.

kNN family imputation
Recently, the k nearest neighbor imputation (kNNI) has become widely applied in medical imputation (Batista and Monard 2003). In kNNI, a dataset is divided into two subdatasets: One is the dataset that contains incomplete data with missing values and the other has complete data without any missing values. The missing values of the incomplete data are replaced by the average of the corresponding attribute of its kNN, and the kNN average is computed with the complete data. However, this method tends to cover noise and outliers to be part of the predictive value; it may affect the accuracy of the imputation. In kNNI family, Troyanskaya et al. (2001) proposed a weighted kNN imputation (WkNNI); Keerin et al. (2016) proposed a cluster-directed framework with neighborbased imputation; and Lee and Styczynski (2018) proposed a new no-skip kNN to impute MNAR values. Additionally, Cheng et al. (2019) proposed the purity k nearest neighbors imputation (PkNNI) is an extension of kNNI, which is based on purity training and purity imputation. Furthermore, Cheng et al. (2020) proposed a distance nearest neighbors imputation (DNNI) was also expanded by kNNI.

Computational intelligence imputation
There are many computational intelligence imputations (Lin and Tsai 2020), such as neural networks, random forests, and fuzzy c-means (FCM). Awan et al. (2021) proposed a class-specific distribution by adapting the popular conditional generative adversarial networks (CGAN) to impute the missing data. We briefly introduce FCM imputations because this study is based on clustering imputation techniques. Hathaway and Bezdek (2001) proposed four FCM imputation techniques, and they pointed out that whole data strategy and partial distance strategy are faster to end, but the optimal completion strategy (OCS) and nearest prototype strategy (NPS) were outperformed over the first two methods based on accuracy and misclassification errors. In addition, Al Shami et al. (2013) applied FCM-based OCS and NPS to compare four statistical imputation methods in their work for accurately substitute missing scores when producing the intelligent synthetic composite indicators. Therefore, we briefly introduced FCM-based OCS and NPS methods as follows.
(1) In OCS approach, the missing values are viewed as additional attributes to be optimized and then impute missing values at each iteration till it reaches the best estimates, (2) NPS is a OCS modification, which computes the partial distances, and missing values are estimated by their nearest prototype counterparts during each iteration. In the hybrid clustering-based imputation method, Dinh et al. (2021) proposed a framework of clustering mixed numerical and categorical data with missing values, it used the decision-tree-based method to find the set of correlated data instance and used the mean and kernel-based methods to obtain cluster centers at numerical and categorical attributes, and they applied the dissimilarity measure to calculate the distances between instance and cluster centers.

Proposed imputation method
Many previous studies directly deleted data with missing values, which may remove the key dataset information and cause research bias. Recently, many studies have applied statistical methods, such as hot-deck imputation (Andridge and Little 2010) and cold deck imputation (Shao 2000), to estimate the missing values. So far, there have been many studies that directly removed outliers from a collected dataset, but the outliers usually had their practical meanings in the real world, such as traffic and popular holiday sightseeing sites. Jerez et al. (2010b) reported that artificial intelligence imputation is better than traditional statistical imputation; therefore, this study proposes an artificial intelligence imputation method to estimate missing values, i.e., the proposed method is based on conditional attributes by k-means to cluster the dataset. Then, we applied PKNNI and DNNI imputation to estimate the missing values. This study used PkNNI and DNNI imputations because PkNNI can help calculate the purity of the same class, and DNNI can calculate the weight distance to find the nearest neighbor group. Furthermore, this study employed elbow and average silhouette methods to determine the best clustering number.

PkNNI and DNNI imputations
The PkNNI and DNNI imputations are the main techniques in the proposed method. Hence, the following introduces the computational steps of PkNNI and DNNI imputations and equations to explain the mathematics and meaning.
(A) The PkNNI imputation (Cheng et al. 2019): The PKNNI can be divided into two parts, one is purity training and the other is purity imputation.
(1) Purity training is to compute the purity of each complete instance to obtain the purity of the ith instance, and the purity training P t i ð Þ is defined as: where the C(i) is the ith complete instance from datasets X, k 1 is the number of the nearest neighbors for purity calculations, N_s denotes the sth nearest neighbor instance, and the function V() returns the class label between instance C(i) and N_s to identify whether they are the same or not. The function vote() is expressed as: where the L i ð Þ denotes the ith class label from datasets X, to deduce whether the instance C i ð Þ is pure or not by comparing the class label L i ð Þ and L j ð Þ.
(2) Purity imputation is based on the complete instances and purity values to predict the missing values, and the imputation equation M i; j ð Þ is defined as follows: where M i; j ð Þ represents the ith instance and its jth attribute, which is the missing value. k 2 is the number of nearest neighbors, and R S; j ð Þ represents the sth nearest instance and its jth attribute, which is a collection that contains all positive purity instance information from the complete instance.
(B) DNNI imputation (Cheng et al. 2020): The computational steps of DNNI are introduced as follows: (1) Calculate the weight set of the distance between each incomplete instance (using no missing values) and all complete instances; the weight set is defined as Eq. (12): where w m is the weight of mth incomplete data and |X| represents the cardinality of the complete data points for training. Additionally, y 1 is the first instance of incomplete data, x 1 is the first instance of complete data, and p is the weighted distance parameters.
(2) Apply the W weight set to compute the weighted distance D ij between x and y. The calculated equation is defined as follows: where d ij is the weighted distance between y i and x i , w i is the weight of y i incomplete data, y i is the ith incomplete data, and x i is the ith complete data.
(3) Imputation The set of weighted distance D ij has an adapted threshold to determine the set of nearest neighbors. The distance threshold is utilized to adjust the optimal nearest neighborhood for estimating missing values; hence, the proposed method does not set the k value of the nearest neighborhood. That is, d ij is less than the threshold, and then, x i will be added to the set of nearest neighbors. The imputation method is based on the adapted threshold to obtain the reference points and then apply central tendencies to estimate missing values from the set of nearest neighbors. The central tendencies have average, median, and geometric mean.

Proposed computational procedure
To easily understand the computational procedure from data collection to evaluation, we use Fig. 1 to show a visual flow. The computational procedure includes the important clustering data and data imputation. The five steps are introduced in detail in the following.

Step 1: data collection
This step collected eight datasets to verify whether the proposed method was better than the listing imputation methods. The eight datasets were seven medical datasets from the UCI dataset repository and one stroke dataset collected from the International Stroke Trial database (Sandercock et al. 2011), which is a real-world dataset. The stroke dataset had missing values, but the seven other UCI datasets had no missing values. Therefore, this study uses different missing degrees to simulate missing values and verify the proposed imputation.

Step 2: data preprocessing
Before the imputation step, the class attribute was converted from numeric to a character (symbol), and we merged multi-column class attributes into the one-column class attribute for the subsequent experiments.
This study chooses 5%, 10%, 15%, and 20% MDs in MAR and MCAR types to verify the performance of the proposed imputation method and compare it with that of listed imputation methods. The 20% MD in this study has Fig. 1 Computational procedure of this study been a very large missing ratio; we used the climate model dataset with 18 attributes and 540 instances as an example, as shown in Table 3. From Eq. (12), we set 20% as MD, and then, there are 1944 missing values (0:2 ¼ 1944= 18 Â 540 ð Þ . However, we used the MD of previous studies to calculate this case; the MD was 1944/540 = 360%. This study used a larger MD (20%) to simulate the experiment in MAR and MCAR types, the main difference that this study had considered the number of attributes. After the simulation of different MDs, each different MD dataset was partitioned into a complete and incomplete sub-dataset. Hence, each dataset had 16 subdatasets: 5%, 10%, 15%, and 20% complete sub-dataset and 5%, 10%, 15%, and 20% incomplete sub-datasets in MAR and MCAR types.

From
Step 2, this step employed k-means to cluster the 5%, 10%, 15%, and 20% complete sub-dataset in MAR and MCAR types for each dataset. Clustering meant that each datum in the same cluster might have had some relationship with another. This step used k-means clustering to cluster all complete sub-dataset in MAR and MCAR types for each dataset.
To find the best number of clusters, this step used two more famous and common methods to determine the best k. One was the elbow method (Ketchen and Shook 1996) to find the best k, where the data were divided into k clusters, and the SSE within the cluster was the smallest (Fig. 2). The other was the average silhouette method (Rousseeuw 1987); the silhouette coefficient measures the clustering effect on the basis of cohesion and dispersion of each data point(i) based on Eq. (6). When s(i) was close to 1, it meant that the data were properly clustered (Fig. 3).

Step 4: data imputation
In this step, all incomplete sub-datasets in MAR, MCAR, and MNAR types for each dataset were replaced by the proposed imputation method and the listed imputation methods. From Step 3, after optimal clustering, the complete sub-datasets were divided into three, five clusters, and the optimal number of clusters because this study wanted to confirm whether the best performance was the best k clusters; the number of clusters was set as odd numbers (3, 5, etc.). Next, this step used the computational steps of PkNNI (Eqs. 9-11) and DNNI (Eqs. 12 and 13) in Sect. 3.1 to estimate all incomplete MAR and MCAR subdatasets.

Step 5: classification and evaluation
After imputation, the accuracy, AUC, and RMSE were applied to evaluate the performance of the proposed imputation. This study used the area under the receiver operating characteristic curve (AUC) because the receiver operating characteristic curve (ROC) has the diagnostic ability in imbalanced classes, and it can treat between positive detection rates and false alarm rates. Besides, the AUC is a measure of discriminative strength between these two rates without considering misclassification costs or class prior probabilities (Moayedikia et al. 2017).
First, this step used C4.5, MLP, NB, BN, and LibSVM classifiers to evaluate accuracy and AUC. The classification results were calculated by using the confusion matrix  (Sammut and Webb 2010); the accuracy is defined as follows: where tp is true positive, fp denotes false positive, fn represents false negative, and tn is true negative. After imputation, this study used eight datasets to evaluate the accuracy, AUC, and RMSE. We used accuracy when data with class balance; if the dataset was imbalanced classes, then this study applied AUC to measure their performance.
Second, this study employed RMSE as the evaluation criteria to see whether there was bias in each imputation method. RMSE is an evaluation criterion to measure the bias between estimated values and real values. The RMSE formula is defined as follows: where e i , i = 1, 2,..., n, denotes the bias between estimated values and the real values.

Experiments and results
This section presents the experimental environment, datasets, and experimental results.
After imputation, this study implemented data classification and evaluated the performance of the proposed and listing imputation methods, and the parameter settings of classifiers are shown in Table 2. In the experiment, this study selected eight datasets, seven medical datasets from the UCI Machine Learning Repository, and one dataset from a real-world stroke dataset. The stroke dataset from IST (Sandercock et al. 2011) contains data between 1991 and 1996, its pilot phase was between 1991 and 1993, and it has 100% baseline data and over 99% complete followup data. It has 19,436 instances, 112 feature attributes with acute stroke. After screening and removing the irrelevant attributes, the stroke dataset had 39 attributes and 4241 instances for this study.
We employed the eight datasets to verify whether the proposed imputation method was better than the listing imputations. The numbers of classes, attributes, imbalanced class, and instances of the UCI medical and stroke datasets are listed in Table 3. From Table 3, we see that all of the selected datasets are imbalanced classes, except for stroke datasets. Therefore, the AUC metric is used in the experiment of seven UCI datasets.

MAR and MCAR experiments
This experiment was divided into an internal and an external experiment; the internal experiment was to verify the performance of the proposed imputation method in different cluster numbers, MDs, and missing values types. After imputation, we employed the C4.5, MLP, NB, BN, and LibSVM classifiers to compare the accuracy of the best number of clusters with three and five clusters. The different imputation comparison was to compare the proposed imputation with the listing imputation methods in different cluster numbers, MDs, and missing value types. In this experiment, the stroke dataset itself had missing values that belonged to the MNAR type, while the seven UCI datasets had no missing values. Therefore, this study used 5%, 10%, ''best k'' denotes the optimal cluster number based on the elbow method and average silhouette methods

Internal comparisons
This section outlines applying elbow and the average silhouette methods to obtain a consistent optimal cluster number and compares PkNNI and DNNI with regard to two (the best number of clusters), three, and five clusters in different MDs, and missing value types. After implementing elbow and the average silhouette methods, the best number of clusters of the collected eight datasets is two clusters. Here, we only show the results of the liver disorders and ILPD datasets in MAR and MCAR types, and the five other datasets are listed in Appendix 1.

Liver disorders dataset
First, we implemented the elbow and average silhouette methods to obtain the best number of clusters, shown in Figs. 2 and 3. As shown in Figs. 2 and 3, two clusters were obtained as the consistent best number of clusters for the liver disorders dataset. Then, we applied the best number of clusters (two clusters), three, and five clusters, combining PkNNI and DNNI to impute the missing values in MAR and MCAR types. After imputation, the C4.5, MLP, NB, BN, and LibSVM classifiers were employed to compute and compare their accuracy. Table 4 shows that the three clusters combined with PkNNI had the best average AUC for all MDs in MAR-type missing values.

Fig. 4 Average AUC of proposed method for different MDs and cluster number in MAR (liver disorders)
A novel clustering-based purity and distance imputation for handling medical data with missing… 11791 Figure 4 shows the no clustering results compared with those with clustering; the three clusters combined with PkNNI had the best AUC, and the five clusters combined with PkNNI had the worst result in different MD. Table 5 shows that the three clusters combined with PkNNI had the best result for all MDs in MCAR. Figure 5 shows the no clustering results compared with those with clustering; the three clusters combined with PkNNI had the best AUC in 5% and 15% MDs, and the no clustering combined with PkNNI had the better AUC in 10% and 20% MD. The best number of clusters combined with PkNNI was the worst result in all different MDs.

ILPD dataset
The best number of clusters of the elbow and average silhouette methods was two clusters for the ILPD dataset. Next, we used two (the best number of clusters), three, and five clusters combined with PkNNI and DNNI to impute the missing values in MAR and MCAR types. After imputation, the C4.5, MLP, NB, BN, and LibSVM classifiers were applied to calculate and compare their accuracy. Table 6 shows that the three clusters combined with DNNI had the best AUC for the four MDs in MAR type. Figure 6 indicates that the three clusters combined with DNNI had the best average AUC in 5%, 10%, and 15% MDs.
In MCAR, Table 7 shows that the three clusters combined with PkNNI had a better result for the l0%, 15%, and 20% MDs in MCAR. Figure 7 shows that the three clusters In MAR, each dataset has 24 average AUC (4 different MDs 9 3 different clusters 9 2 imputation methods). The comparison is based on the same MD to count the number of wins, but the average AUC of banknote and acute dataset could not distinguish their differences; then, we only made 30 comparison results. The best k clusters won 11 times (included eight even), the three clusters won 12 times (contained six even), and the five clusters succeeded four times (including three even) in four different MDs of the five datasets, as shown in Table 13 of Appendix. Then, the proposed ''three clusters ? PkNNI'' had better AUC in the seven UCI datasets. In MCAR, the comparison was made 42 results (3 different clusters 9 2 imputation methods 9 7 datasets); the best k clusters won 17 times (included 10 even), the three clusters won 20 times (contained 12 even), and the five clusters succeeded 13 times (including 12 even) in four different MD of seven datasets, as shown in Table 14 of Appendix. Then, the proposed ''three clusters ? PkNNI'' had better accuracy in the seven UCI datasets.

Different imputation comparisons
After the internal experiment, most datasets show that no clustering had the worst result. Hence, this section outlines A novel clustering-based purity and distance imputation for handling medical data with missing… 11793 the external experiments to compare the proposed imputation with the listing imputation methods in the different cluster numbers, MDs, and missing types.

AUC metric
The results are shown in Tables 8 and  9 for the open seven datasets in MAR and MCAR. From the experimental results, we summarize the results of the seven open UCI datasets in the following.
In MAR, the proposed imputation is better than the listing imputation methods for the seven UCI datasets in all different MDs, as shown in Table 8. In addition, the proposed combined clustering with PkNNI is slightly better than the combined clustering with DNNI. In MCAR, similarly, the proposed method has a better average AUC than the listing imputation methods for the seven UCI datasets in all different MDs, as shown in Table 9. In addition, the proposed combined clustering with PkNNI is better than the combined clustering with DNNI.  imputation method in MAR and MCAR, where the average RMSE was the average of ten imputed datasets for each MD. The minimal RMSE of seven imputation methods in MAR and MCAR is shown in Table 10. In MAR, we could see that the combined clustering with PkNNI imputation had the best performance (minimal average RMSE) in seven imputation methods, as shown in Table 10 and Fig. 8. In MCAR, the combined clustering with DNNI imputation had the minimal average RMSE in the seven imputation methods, as shown in Table 10 and Fig. 9.

MNAR experiment
The MNAR refers to the patients who are not willing to provide the relevant data to the physician due to her privacy, and the missing values are subject to the unobserved patient attributes. The stroke dataset itself had missing values that belonged to the MNAR type. This study directly clustered the complete dataset and imputed the incomplete dataset, and then classified the imputed dataset by five classifiers to verify the proposed imputation. From Table 3, the stroke dataset is balanced classes, and this study uses classification accuracy and AUC to measure comparison results.
In an internal comparison, the results of the stroke dataset are shown in Table 11, indicating that the three clusters combined with DNNI had the best average accuracy. In no clustering and different cluster numbers, Fig. 10 shows that DNNI was better than PkNNI in the average accuracy of five classifiers, and clustering imputation was better than no clustering imputation. In an external comparison, Table 12 shows that the proposed DNNI imputation had the best average accuracy and AUC among the seven imputation methods.

Discussions and findings
After the MAR, MCAR, and MNAR experiments, we provide some discussions and findings as follows. The bold number denotes the best average AUC of each MD The bold number denotes the best average AUC of each MD The bold number means the best RMSE in the seven imputation methods

Cluster numbers effect
After determining the number of clusters by the elbow and the average silhouette methods, the best number of clusters in the collected eight datasets is two clusters. However, all best AUC and RMSE do not belong to two clusters from internal and different imputation comparisons. Figures 4,5,6,7,and 10 show that the combined clustering with PkNNI and DDNI can obtain better AUC than no clustering for the collected datasets. Overall, the best combination of the proposed imputation is the combined three clusters with PkNNI because the different datasets have different numbers of classes, data properties, and disadvantages of kmeans. The disadvantages of k-means clustering are (1) selecting appropriate k, (2) dependent on initial values, (3) outliers impacting bias, and (4) assuming all variables have the same variance. Similarly, FCM-based imputation also has some disadvantages. For example, OCS (Hathaway and Bezdek2001;Al et al. 2013) used initial values and available feature values to calculate FCM cluster prototypes, and the missing values were estimated based on these biased cluster prototypes. Hence, the computation of the FCM cluster prototype and the imputation of the missing value would influence each other. In addition, Li et al. (2010) developed a fuzzy clustering algorithm based on the nearest neighbor interval (FCM-NNI) to enhance the performance. Therefore, we suggest that FCM-based imputation can add some methods to enhance its performance.  Fig. 9 Minimal RMSE of seven imputation methods in MCAR. blood*, Climate*, Haberman*, and ILPD* denote that the RMSE of the four datasets is processed by (RMSE/10) to plot the figure because they have a large RMSE

MNAR imputation
MNAR missing values is a non-ignorable non-response, and its data are neither MAR nor MCAR. That is, the missing values of the variable have their reason based on privacy (Polit and Beck 2012). The stroke dataset is collected by questionnaire of stroke patients, which had missing values that belonged to MNAR type. In MNAR, we could not calculate their RMSE because the missing values did not have actual values, and this study only used accuracy and AUC to evaluate their performance. After the experiments, the proposed imputation (three clusters ? DNNI) had the best average accuracy and AUC from the seven imputation methods in the stroke dataset. The stroke dataset has many category attributes, but the proposed imputation also has the best performance, indicating that the proposed imputation method is viable. In this experiment, the best performance is three clusters combined with DNNI because the best fitting clusters combining with the imputation methods is an important factor.

Multiple-combined imputation methods
In recent years, there are many hybrid imputation methods. Zhang et al. (2017) proposed a flexibly combines three techniques: self-organizing feature map clustering, the fruit fly optimization algorithm, and the least squares support vector machine to impute spatiotemporal missing values. Dubey and Rasool (2020) presented the combined k-means clustering with the weighted KNN to impute the missed value, and their results outperformed mean substitution and FCM imputation. This study proposed the combined kmeans clustering with PkNNI and DNNI imputation, and the experimental results showed that the proposed imputation was better than the listing imputation methods in the eight datasets. We find that the combined local similarity structure of the dataset (using k-means, or self-organizing map, or FCM clustering) with an imputation method can enhance the performance. Therefore, we suggest that the FCM, MI, MICE, and kNN imputation methods can be added into the advanced techniques for improving the imputation performance.

Conclusion
This study has proposed clustering-based purity and distance imputation methods to improve performance. After MAR, MCAR, and MNAR experiments, the proposed imputation was found to be better than the other imputation methods in seven UCI datasets and a stroke dataset, except for the acute inflammations dataset. In the RMSE of MAR and MCAR experiments, the combined clustering with PkNNI (DNNI) imputation had the minimal average RMSE in the seven imputation methods. Results showed that the proposed imputation was better than the listed imputation methods, mainly because clustering would aggregate similar data points in one cluster. To obtain the optimal number of clusters, we applied the elbow method and average silhouette method to obtain a consistent optimal number of clusters, and the best cluster number was two in the eight datasets. The number of classes of the seven datasets was two classes except for the acute inflammations dataset with four classes, as shown in Table 3, and the acute inflammations dataset almost had category attributes. From the results and findings, we summarized the contribution and applicability of the proposed method as follows: The bold number denotes the best average accuracy and AUC in the seven imputation methods (1) The proposed imputation is a hybrid method, and the attributes data do not obey multivariate normal distribution, i.e., the proposed approach is combined k-mean with the two kNN-related approaches. The main difference between the two approaches (PkNNI and DNNI) and the proposed method is to find the optimal number of clusters and adapt the two kNNrelated approaches to obtain the optimal results for the eight medical datasets. Because PkNNI with two parameters needs to train for optimal results, DNNI with different averages and thresholds must adapt to obtain the optimal results. (2) Medical data have the higher MDs; hence, the research works need to check a higher percentage of complete data (e.g., normally more than 50% of the total cases) and needed to impute less than three values per patient (Kharrazi et al. 2014). However, this study simulated 360% MD (MD = 20% in this paper) to handle missing values. Especially in the practical stroke dataset, there were 69 attributes with 4242 instances and 6578 missing values; hence, traditional MD is 155% (MD = 2.25% in this paper).
In addition, the largest MD of the single attribute was 95% and needed to impute 14 values per patient.
(3) In selecting the imputation method, Sim et al. (2015) suggested to understand the characteristics of the dataset (especially the patterns of missing values) for obtaining the better performance. This study proposed a hybrid imputation to handle missing values in MAR, MCAR, and MNAR types; the different imputation method has its advantages and disadvantages; hence, a set of optimal combinations may be derived using the estimated results. (4) How to evaluate imputation performance is an important issue. Most previously used accuracy and RMSE, but medical data usually are imbalanced classes; this paper applied accuracy, RMSE, and AUC to overcome the problem of imbalanced classes.
In future work, we can utilize the frequency-based or neighbor-count imputation to improve the performance of the category dataset and combine it with other model-free imputation methods to conduct massive experiments.

Appendix
See Tables 13 and 14.

Declarations
Conflict of interest The authors declare that they have no conflicts of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.