A Review of Current Publications Trend on Missing Data Imputation Over Three Decades: Direction and Future Research

Missing value or sometimes synonym as missing data, is an unavoidable issue when collecting data. It is uncontrollable and happen in almost any research elds. Hence, this study focused on identifying the current publications trend on missing data imputation techniques (1991- 2021) specically in classication problems using bibliometric analysis. Most importantly, this research aims to uncover the potential missing data imputation methods. Two software were used; VOSViewer and Harzing Publish or Perish. Based on the Scopus database extracted in June 2021, the ndings indicate an emerging trend in missing data imputation research to date, while there are two imputation methods that get the most attention; the random forest and the nearest neighbor methods.


Introduction
Missing data problem ubiquity encountered by researchers when analyzing real-world data. Usually, the real-world data contains many errors like incomplete data, inconsistent format (discrepancy in code), missing patterns and sometimes contain outliers. Most of the time, data scientists or researchers may spend lots of their time in data preprocessing. Data preprocessing has been indicted by researchers as a rudimentary stage in machine learning (ML) method (1). Many classi cation models (before applying any ML algorithms) are incapable of handling missing values directly. As a result, dealing with missing values in the data preprocessing step remains an important step in the classi cation process prior to estimation.
A well-known technique known as listwise deletion (or complete case analysis), had been extensively used to handle missing values during data preprocessing (2) (3). Ignoring or deleting instances with missing data is a common practice in some eld (4) (5). However, it degrades the valuable information contain in the missing data and decrease statistical power as the sample size reduced (6)(7). Lin et al. (5) had experimented with the case deletion technique and he concluded that it can be used if the missing rate is small while their performance is parallel with imputation technique. But it depends on data type (categorical, numerical or mixed-type), missing mechanisms and the number of attributes or classes. The result is remarkably well in numerical dataset with missing rate up to 20% while in mixed-type dataset, the missing rate is up to 17%. The severe effect happens when the missing data substituted with zero or null value, where it produces biases in prediction and will interfere in decision making (8).
In contrast, missing data imputation technique replaces missing values with arti cial estimates while maintaining data completeness (9). In the past three decades, multitude imputation approaches had been studied, range from statistical procedure to machine learning algorithms. The statistical procedure includes mean (10), mode and median (11), linear interpolation (12), regression (13) or by machine learning methods, such as K-nearest neighbors (14) (15), Fuzzy c-means (16), random forest (17), neural network (18), and decision trees (19). However, there is no solid conclusion in deciding which imputation model is the best because it depends on the type of data, missing proportion and also missing data mechanism (5). Missing data mechanism composed of missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR) (20). MCAR means every data in each attribute have an equal chance to be missing, due to technical errors like machine breakdown or system failure. MAR relates with the missingness probability of an attributes depends on the observed information, but not depends on the missing data in that attribute. Whilst, MNAR is happened when the missingness probability of an attributes depends on the missing data in that attribute. Usually, researchers assume the missing data is either MCAR or MAR, while MNAR is complicated to identify. Before employed any imputation method, it is advised to identify the missing pattern either MCAR, MAR or MNAR.
Despite growing interest towards missing data imputation techniques, surprisingly, to the best of author's knowledge, there have been relatively very limited attempts in reporting the trend of prior works particularly those that used bibliometric approach. Only Adnan (21) reported using bibliometric analysis in studying on missing data covers 60 years (1960-2019) of research history, but the analysis is on general information (publication growth, document's language, subject area, and country of focus).
Hence, this study expands the research by Adnan (21) which focused more on missing value imputation in classi cation problems. Moreover, this study aims to reveal the most impact authors, the most impact publications as well as the potential research gaps in missing data imputation method.
The next section discussed in details on the data source and methodology used in bibliometric analysis.
Then, the analytical results are displayed in the form of graphs and tables, as well as visualization of the interconnection between keywords, authorship, and citations. Discussion and conclusion are presented in the last section.

Data Sources and Preprocessing
This study employed Scopus database as a basis to extract prior works on missing value-related matters.
On 7th June 2021, a search was conducted with the keywords "missing data" or "missing values" or "missing value" or "incomplete data" and "imputation" and classi cation. To further specify relevant literature on missing data imputation, the search was based on article title, abstract and author's keywords and it returned 779 related papers. The result was later re ned by comprising journal articles which covers from year 1991 to 2021. Finally, after screening, a total of 430 published journal articles were selected and included in the study.

Bibliometric Analysis
Bibliometric study or also known as scientometrics study, utilizes mathematical and statistical tools in the analysis in order to quantify and discover trends of the published materials. This study commences with a descriptive summary of the published documents by tabulating and graphing it by year, subject area, author, country, and document language. Next, an extensive bibliometric analysis comprises of the citation, authorship and keywords analysis can be achieved using VOSViewer and Harzing Publish or Perish software. The Harzing Publish or Perish software was used to show the citation metrics such as the total citations, document's average citations per year, document's average number of authors as well as reveals the most impact researchers. Whilst for the VOSViewer, it was used to visualize the interconnection among authors, documents and keywords used by various authors.

Analysis And Findings
The analysis of extracted documents is divided into two phases; descriptive analysis and analysis of keywords, citations and authorship. The result also reveals the top 20 most cited articles in missing data related issue until June 2021.

Publication Growth
Based on the Scopus database, the rst published journal article on missing data imputation in classi cation problems was in 1991 by Clogg, Rubin, Schenker, Schultz, and Weidman (22) where they studied on multiple imputation-based Bayesian logistic regression to generate new database (Tab. 1).
They also listed on the top 20 most cited articles (Tab. 6) with 118 citations. According to Fig. 1, it shows a slow growth on the related publication from year 1991 until 2005 with the maximum number of publications is three. Following that, it shows a notable improvement with eight publications in 2006 and gradually increase since then. The possible reason for the publications in missing value was kicked started because of the popularization of data mining eld (23). It is important to overcome missing values as it is the major drawback in data analysis. The highest number of publications was in 2020 with 63 articles or equivalent to 14.65% as in Tab. 1. It is anticipated that the number of publications will increase signi cantly in 2021, as there are already 36 papers in June 2021 when this article was extracted. Furthermore, an overall citation count of 9605 as in Tab. 5 con rms the relevance of the topic.  *Some documents are classi ed in more than one subject area

Country Productivity
Researchers from 56 countries have expressed their interest in the study of missing data across various eld. More and more countries begun to devote themselves in the research related with missing data such as to determine techniques in replacing the missing value (23)(24), understanding pattern of missing data (25)(26), and also evaluation of missing value imputation on classi cation accuracy (27).

Documents by Author
According to the Scopus database, the top ten authors in missing data related publications are shown in Fig. 4. Twala, B. had recorded the highest contribution with 8 articles, followed by Garcia-Laencina,P. J., and Sancho-Gomez, J. L., with 6 articles each. Among the contribution by Twala, B., is on the use of the neural networks in dealing with class imbalance and missing data problems (28), classi cation and regression trees in missing data with high attribute correlations (29), k-nearest neighbor (KNN) and support vector machines in missing data with higher complexity with limited number of instances (30). The second top ten authors, Garcia-Laencina, P. J., which is co-author with Sancho-Gomez, J. L., had proposed a novel KNN imputation with feature-weighted distance metric based on mutual information (MI) on solving classi cation task (15). In different research, he presented a new public software for missing data imputation, called Web IMPutation, that is linked to a computer cluster to perform high computational tasks. The software is free, where registered users can create, run, analyze and save simulations related to missing data imputation (31).

Keywords Analysis
A keyword analysis has been performed using the VOSViewer in order to evaluate the speci cs debate on the missing data related publications. The analysis reveals that 3835 keywords were used within the papers. The number of keyword occurrence is set to be at least 8 times and resulting 135 items/selected keyword. From Fig. 5, it revealed the existence of three clusters, and it can be group according to the area of research; computer science with 58 items (Red Cluster), medicine with 42 items (Green Cluster), and mathematics/statistics with 35 items (Blue Cluster). This result is parallel as mentioned previously in the section 3.1.2 where computer science, mathematics/statistics and medicine area dominated in this study. The size of the nodes varies according to the importance of the element. For example (Fig. 5), on the keyword classi cation, missing data, imputation, classi cation (of information), and support vector machine appear to have big circle, hence it means most discussion with highest occurrence on this topic.
In contrast, the smaller circle re ects less occurrence with low frequency on the keyword such as genetic algorithm. Each keyword is linked to another keyword. For instance, in Fig. 6, imputation keyword links with data mining, nearest neighbor search, neural networks, classi cation accuracy, learning systems, feature extraction, classi cation (of information), imputation methods, missing values, incomplete data, missing value imputations, optimization, cluster analysis, algorithms, data analysis, humans, article, priority journal, female, adult, middle age and aged keyword. The link shows the topic they are discussed together. The different in distance between two keywords indicates their relatedness of the keywords, the shorter the distance, the stronger their relatedness.
Overlay visualization as in Fig. 7 Fig. 8, the hot topic discussed in the research are "missing data", "classi cation", "article" and "human" turn out to be important.  [23], [27], [34], [45]. It should be noted that multiple imputation is a well-known method in medical research. Summary for the rest of the top 20 articles presented in Tab. 7.

Citation Analysis by Documents/Articles
The VOSViewer software was employed in order to comprehend thoroughly on citation analysis by documents. The citation analysis by documents was executed in order to measure the citation impact on certain documents and to investigate the expansion of an article. As an illustration, in Fig. 9, on the article "Missforest-Non-parametric missing value imputation for mixed-type data" by Stekhoven

Citation Analysis by Authors
This section was designed to study an impact of authors based on citation. With at least three number of documents and 100 citations of an author, the most impact authors on the study of missing value are Zhang, S., Zhu, X., Herrera, F., Luengo, J., Twala, B., Li, X., Zhang, Z., Pan, Q., and Garcia-aencina, P. J. (Tab. 8). The relationship among the authors can be seen as in Fig. 10, where Garcia-laencina, P. J., Twala, B., Herrera, F., Luengo, J., and Pan, Q., were in the same cluster, Cluster 1, while Li, X., Zhang, Z., Zhang, S., and Zhu, X., were in Cluster 2.
Overlay visualization of citation analysis by authors in Fig. 11  language. It should be noted that Twala, B. is the most productive author in this research with 8 publications until now, while Zhang, S. is the most impact author where had received the highest citations (554 citations). The most impact document is "Missforest-Non-parametric missing value imputation for mixed-type data" by Stekhoven, D. J. (2012) with total count of 1069 citations.
Based on the evolutional pathway performed in this study as in Fig. 7, surprisingly reveals two potential techniques in missing data imputation, they are random forest and nearest neighbor search (kNN) algorithm. Both methods appear to have the same strength in dealing with missing values include mixedtype attributes, MAR, MCAR, and MNAR missing mechanism, and belong to the same category; nonparametric method. These methods are robust; require no information on data distribution (5). However, previous researchers did not compare both methods in missing data imputation (17). Therefore, it is recommended future research directions to compare these two powerful methods in missing data imputation for evaluating their performances.   Overlay visualization of citation analysis by documents Figure 10 Network visualization of citation analysis by authors Page 30/30 Figure 11 Overlay visualization of citation analysis by authors