This section is divided in four subsections: the first one introduces the linkage to HIV care dataset, the second gives an overview on missing data imputation and imputation approaches. The third subsection presents an overview of the imputation algorithms and models considered in this study, the fourth describes the simulations performed and defines and discusses the performance measures chosen for evaluating imputation precision and prediction accuracy for modeling.
The Linkage to HIV Care Study Dataset
The data were collected in rural Uganda as part of the PATH/Ekkubo study [2] which tested an intervention to increase the percentage of individuals diagnosed with HIV who promptly enroll in HIV care, start treatment, and successfully achieve viral suppression indicating treatment success and reducing the likelihood of transmitting the virus to others. The study was implemented in the context of door-to-door home-based HIV testing. The present paper uses data from the baseline questionnaire collected between November 2015 and December 2018 from all individuals who consented to be tested for HIV in the home-based HIV testing and to complete a questionnaire interview.
For this study, 27 socio-demographic, psychosocial, and health variables were selected as possibly relevant when considering HIV status outcome. The complete sample consisted of 23,937 participants from 26 different villages in Butambala, Mpigi, Gomba and Mityana districts of Central Uganda. Of which, 9,099 cases are completely observed. The 14,838 cases left do not have information for alcohol consumption risk level and depressive symptoms score because these measures were added to the questionnaire in November 2017. Additionally, of the 14,838 incomplete cases, 2,612 also have missing information for occupation because this measure was changed during the course of the study and 76 were also incomplete for pregnancy status, for unknown reason. Alcohol consumption risk level and depressive symptoms score are represented by an integer ranging from 0 to 12, and 0 and 30, respectively. Depressive symptoms were measured using the 10-item Center for the Epidemiological Studies of Depression Short Form (CESD-10) as defined in [7]. Alcohol consumption risk level was assessed using the Alcohol Use Disorders Identification Test-Concise (AUDIT-C), defined in [8]. Pregnancy and occupation are categorical variables with three and six unordered classes, respectively.
Descriptive statistics for numeric variables such as mean, standard deviation and range, in addition to 95% confidence interval for the difference of means, by complete and incomplete cases are presented in Table 1. A few variables presented a relatively larger difference of means - mean monthly income, individual level anticipated stigma from family or health care workers due to HIV status and years of education values were higher for incomplete cases, and age was lower.
The distributions of complete and incomplete cases for log of income, stigma from family and health care workers and years of education are shown in Figure 1. The plots suggest that the incomplete portion of the sample represents a more affluent sector of the population that might coincide with more years of education, and also that a portion of the incomplete cases anticipated a higher stigma from health care workers and family due to HIV status.
Table 1: Baseline mean and means difference for continuous variables, by completeness of cases.
|
Complete
|
Incomplete
|
Total
|
Means Difference
|
|
n = 9099
|
n = 14838
|
n = 23937
|
|
Variable
|
mean (sd) min; max
|
mean (sd) min; max
|
mean (sd) min; max
|
95% CI
|
village population (x 1000)
|
2.86 (0.92) 0.55; 3.97
|
2.82 (1.15) 0.44; 4.42
|
2.84 (1.07) 0.44; 4.42
|
(0.01; 0.07)*
|
age centered at 14
|
17.34 (11.11) 0; 45
|
15.05 (10.07) 0; 45
|
15.92 (10.5) 0; 45
|
(2.01; 2.57)**
|
monthly income in dollars
|
24.68 (41.71) 0; 833
|
38.77 (125.14) 0; 11111
|
33.42 (102) 0, 11111
|
(-16.28; -1.9)**
|
time to health clinic (x 15 minutes)
|
1.96 (1.99) 0.07; 28
|
1.65 (1.45) 0; 30
|
1.77 (1.68) 0; 30.33
|
(0.27; 0.37)**
|
Anticipated stigma disclosure concern
|
2.57 (1.06) 0; 4
|
2.57 (1.05) 0; 4
|
2.57 (1.05) 0; 4
|
(-0.03; 0.03)
|
Anticipated HIV stigma from family
|
1.08 (0.89) 0; 4
|
1.26 (0.99) 0; 4
|
1.19 (0.96) 0; 4
|
(-0.21; -0.17)**
|
Anticipated HIV stigma from health care workers
|
0.81 (0.64) 0; 4
|
0.91 (0.74) 0; 4
|
0.87 (0.70) 0;4
|
(-0.12; -0.08)**
|
Depressive symptoms
|
5.83 (5.65) 0; 30
|
-
|
5.83 (5.65) 0; 30
|
-
|
Alcohol consumption risk level score
|
1.09 (2.2) 0; 12
|
-
|
1.09 (2.2) 0; 12
|
-
|
Village HIV prevalence
|
0.06 (0.02) 0.03; 0.1
|
0.04 (0.01) 0.02; 0.05
|
0.05 (0.02) 0.02; 0.1
|
(0.02; 0.02)**
|
Enacted stigma village level
|
0.98 (0.12) 0.8; 1.4
|
1.05 (0.16) 0.8; 1.4
|
1.03 (0.15) 0.8; 1:43
|
(-0.07; -0.07)**
|
Anticipated stigma due to HIV disclosure concern village level
|
2.57 (0.26) 2.2; 3.0
|
2.57 (0.49) 1.9; 3.3
|
2.57 (0.42) 1.9; 3.3
|
(-0.01; 0.01)**
|
Anticipated stigma public attitudes village level
|
1.65 (0.27) 1.4; 2.3
|
1.62 (0.21) 1.4; 2.0
|
1.64 (0.23) 1.4; 2.3
|
(0.02; 0.04)
|
Anticipated HIV stigma from family village level
|
1.07 (0.13) 0.9; 1.5
|
1.26 (0.12) 1.1; 1.7
|
1.19 (0.15) 0.9; 1.7
|
(-0.19; -0.19)**
|
Anticipated HIV stigma from health care workers village level
|
0.81 (0.09) 0.6; 0.9
|
0.92 (0.16) 0.7; 1.4
|
0.88 (0.15) 0.6; 1.4
|
(-0.11; -0.11)**
|
Years of education
|
6.55 (3.49) 0; 15
|
8.08 (3.35) 0; 15
|
7.5 (3.48) 0; 15
|
(-1.61; -1.43)**
|
Note. Significance for difference of means: * p-value <0.001, ** p-value <<0.00001, CI = Confidence Interval for mean difference, sd = standard deviation, min = minimum, max = maximum, bold shows values with larger differences.
|
Categorical variables are presented in Table 2, along with frequencies and percentages. P-values for the difference in frequency between complete and incomplete were all smaller than 0.001 and are not shown in the table. Some variables and categories with seemingly larger differences are emphasized in bold such as wealth, occupation, HIV and marital status.
The distributions of the incomplete variables, by HIV status are shown in Figures 2 and 3. For depressive symptoms score, the distribution is similar for negatives (-) and new positives (new +), and presents higher values and more spread for known positives (known +). For alcohol consumption risk level, new positives have wider spread and higher median values than known positives, which, in their turn have wider spread and higher values than negatives. For categorical variables, the figures are also grouped by complete and incomplete cases; incomplete cases being the ones that have information for the categorical variable but not for alcohol and depressive symptoms variables. Figure 3 shows there is a slight difference in proportions of pregnancy status for complete (C) and incomplete (I) cases across the three classes of HIV status is observed. And there is a higher difference in the proportion of pregnancy in complete versus incomplete cases for new positives than in the other two HIV classes. For occupation variable, there is a difference in distribution in each HIV class in occupation levels when comparing complete and incomplete cases. For new positives, there is a larger proportion in the fisherman and peasant farmer categories in complete cases. There is a higher proportion of salaried tradesperson in incomplete cases, while the proportion of people not employed outside of the house seems to be similar across complete and incomplete cases for all HIV classes.
Table 2: Baseline frequency distribution for categorical variables by completeness of cases.
Variable
|
Categories
|
Complete
(n = 9099) n (row %)
|
Incomplete
(n = 14838) n (row %)
|
Total
(n = 23937) n (row %)
|
HIV status
|
Negative
|
8546 (93.9)
|
14284 (96.3)
|
22830 (95.4)
|
New Positive
|
140 (1.5)
|
314 (2.1)
|
454 (1.9)
|
Known Positive
|
413 (4.5)
|
240 (1.6)
|
653 (2.7)
|
gender
|
Male
|
3342 (36.7)
|
6707 (45.2)
|
10049 (42.0)
|
|
Female
|
5757 (63.3)
|
8131 (54.8)
|
13888 (58.0)
|
religion
|
Muslim
|
2184 (24.0)
|
3870 (26.1)
|
6054 (25.3)
|
|
Christian & non-Muslim
|
6915 (76.0)
|
10968 (73.9)
|
17883 (74.7)
|
Wealth index
|
lowest quintile
|
3356 (36.9)
|
1517 (10.2)
|
4873 (20.4)
|
|
2nd lowest quintile
|
2311 (25.4)
|
2971 (20.0)
|
5282 (22.1)
|
|
3rd lowest quintile
|
1590 (17.5)
|
2867 (19.3)
|
4457 (18.6)
|
|
4th lowest quintile
|
984 (10.8)
|
2850 (19.2)
|
3834 (16.0)
|
|
highest quintile
|
858 (9.4)
|
4633 (31.2)
|
5491 (22.9)
|
away for work 1 month or more
|
not away
|
8039 (88.4)
|
12677 (85.4)
|
20716 (86.5)
|
one or more times
|
1060 (11.6)
|
2161 (14.6)
|
3221 (13.5)
|
transportation
|
free: walking/bike
|
4882 (53.7)
|
7370 (49.7)
|
12252 (51.2)
|
|
low cost: taxi
|
958 (10.5)
|
2870 (19.3)
|
3828 (16.0)
|
|
high: boda/car
|
3259 (35.8)
|
4598 (31.0)
|
7857 (32.8)
|
marital status
|
Never married
|
1618 (17.8)
|
4498 (30.3)
|
6116 (25.6)
|
|
married
|
1556 (17.1)
|
2204 (14.9)
|
3760 (15.7)
|
|
widowed/divorced
|
5925 (65.1)
|
8136 (54.8)
|
14061 (58.7)
|
another household member is HIV +
|
No
|
7260 (79.8)
|
12067 (81.3)
|
19327 (80.7)
|
Yes
|
390 (4.3)
|
442 (3.0)
|
832 (3.5)
|
Do not know
|
1449 (15.9)
|
2329 (15.7)
|
3778 (15.8)
|
|
|
n = 9099
|
n = 14762
|
n = 23861
|
pregnancy of self or partner
|
No
|
7272 (79.9)
|
10918 (74.0)
|
18190 (76.2)
|
Yes
|
906 (10.0)
|
1236 (8.4)
|
2142 (9.0)
|
No partner
|
921 (10.1)
|
2608 (17.7)
|
3529 (14.8)
|
|
|
n = 9099
|
n = 12226
|
n = 21325
|
occupation
|
Peasant farmer
|
3998 (43.9)
|
3046 (24.9)
|
7044 (33.0)
|
|
Casual worker
|
839 (9.2)
|
1182 (9.7)
|
2021 (9.5)
|
|
Salaried trade
|
750 (8.2)
|
2988 (24.4)
|
3738 (17.5)
|
|
fish/csw/rest/bar/attendant
|
430 (4.7)
|
83 (0.7)
|
513 (2.4)
|
|
business selling
|
1287 (14.1)
|
1799 (14.7)
|
3086 (14.5)
|
|
not employed out home
|
1795 (19.7)
|
3128 (25.6)
|
4923 (23.1)
|
Note. n = sample size, bold shows values with larger differences.
|
As mentioned before, if CCA is used for this dataset, more than 60% of the cases will be discarded when alcohol consumption or depressive symptoms variables are included in the analysis model. Bias may be introduced due to some systematic differences between complete and incomplete cases. Furthermore, critical information is potentially lost when attempting to predict new positive HIV cases via any classification method because of the small number of complete cases in this category. To predict HIV status, random forest (RF) models [9] were used for the prediction models because they were found to perform well in preliminary analysis.
Missing Data Imputation
As in our linkage to HIV care dataset, missing values are common in socio-demographic, environmental, medical, epidemiologic, political and biological sciences research. Therefore, when exploring a dataset, if missing data is present, it is important to determine how to proceed. If missing data is random, sparse and does not decrease the sample size significantly, CCA may be efficient. On the other hand, when there are nonrandom missing patterns or a high percentage of incomplete cases, CCA may produce biased results and loss of statistical power [10,11]. When adequate, CCA is advantageous for its simplicity and ease of implementation. There are also some specific cases in which CCA is appropriate even when missing data is not random. For example, coefficient estimates may be unbiased when linear or logistic regression is the analysis model and missing data does not depend on the outcome, after taking covariates into consideration by including them in the model. Hughes et al. [12] presents a rich discussion of when CCA leads to bias or loss of precision.
When the impact of missing data is not negligible, missing data imputation may be a suitable alternative. Single imputation (SI) and multiple imputation (MI) techniques use information from observed values to fill in values for the unobserved ones.
Single imputation consists of replacing each missing value in the dataset by a plausible one, and then CCA is conducted in this newly completed dataset [13]. The imputed value may be the target variable mean, range or some type of regression prediction, for example. The main problem found with this approach is that the uncertainty due to guessing the unobserved values is not accounted for, leading to two main issues: variances of regression estimates tend to be underestimated, generating low coverage confidence intervals and coefficient estimates may be biased, as Buuren explains in [14]. More efficient SI techniques such as Maximum Likelihood (ML) and Expectation Maximization (EM) algorithms have been developed and generally result in less bias, but according to Clavel et al. [10] the reliability of results is hard to assess depending on patterns and percentage of missing data. The missForest package in R [4] is the SI method used for this study.
Multiple imputation consists in creating several complete datasets. Each dataset contains the same information on the observed values, and imputed values for the missing ones, these may be distinct in each complete dataset, so that the datasets generated are potentially different [14]. This difference inflates the variance of model parameters, therefore taking into account the uncertainty of the imputations [1, 13,14]. A CCA analysis is carried out in each complete dataset and results are combined into one. The rule presented by Rubin, D.B. [15] is applied for averaging analysis model parameters and combining their variances. Most MI methods generate five complete data sets as a default, and according to Buuren [14] and Clavel et al. [10], this is usually sufficient to get a valid estimate for the imputation variance. For this study, the MI approach will be evaluated with R packages amelia [3,16,17], mice [5,14] and hmisc [6].
Different imputation methods included in this study use different imputation models for imputation of missing data. An imputation model has a target variable as its outcome and all the other variables as predictors or auxiliary. Predictions from the imputation model, which is fit using the observed values, are used to impute into the missing values. More information in the imputation model will create better imputations because imputation is predictive [17], therefore, the imputation model should include at least as much information as the analysis model of interest, including transformations and interactions of variables. In general, all variables should be considered as auxiliary, even when there is not a difference in distribution between complete and incomplete data, and also include the other incomplete variables, which can have their own imputed values in the missing ones.
The two main MI approaches are Joint multivariate (JM) and fully conditional specification (FCS). In JM, it is assumed that the data follows an approximately joint distribution. The algorithm uses observed data to estimate parameters for the joint distribution, randomly samples from it and then imputes missing values to create complete data sets. The amelia package follows the JM approach. With the FCS approach, each target variable has its own imputation model defined, conditional on auxiliary variables from the dataset, which may or may not be predictors for the analysis model. Mice and hmisc packages follow an FCS approach.
In general, when doing multiple imputation, the missing data may depend on observed values but cannot depend on any unobserved data, and this missing data mechanism is called missing at random (MAR). The linkage to HIV care dataset contains data with MAR characteristics; incomplete cases come from specific villages and from a set period of the study that collected the data, which are both known information, along with several observed variables that contain information that may support and inform the imputation process.
Imputation Algorithms and Models
The amelia package was first presented by King et al. [17], and later discussed for time-series cross-section data and its algorithm updated by Honaker & King [16,3]. Amelia assumes missing data MAR mechanism and joint multivariate normal data distribution. For binary and ordered categorical variables, continuous values are imputed and then rounded to the nearest category. Unordered categorical variables are broken into binary indicators for each category, they are then manipulated in a fashion where imputed values represent parameters for a multinomial distribution and then categories are imputed based on the distribution probabilities [1]. Amelia runs an EM algorithm to estimate the mean and variance of the joint normal distribution. This is done once for each imputed dataset, each based on a different bootstrap sample of the data. Bootstrapping is done so that each complete dataset generated has potentially different imputed values.
The multivariate imputation by chained equations (mice) package in R has consistently and recently been updated for functionalities such as MI for data that is missing not at random (NMAR), tree based models imputation and multi-level data imputation. In mice, an imputation model for the conditional distribution estimation for each type of target variable is defined. Several options are available, depending on the apparent distribution and prior knowledge of each variable. The parameters of the imputation model are estimated via a pseudo Gibbs Sampler algorithm. The imputation models in mice investigated in this study are linear and multinomial regression for numeric and categorical variables respectively, and classification and regression trees for both variable types.
The Harrell Miscellaneous (hmisc) package in R, presented by Alzola & Harrell [6] has numerous functionalities for data analyses, advanced graphics and tables, sample size, power computation, missing data imputation among several others. The aregImpute function from hmisc conducts MI and follows the same FCS approach as mice, imputing one variable at a time, estimating its distribution conditional on other variables. The main difference from mice is in its imputation model, which is based on additive semi-parametric regression, where the models are defined based on transformations of the target variable and the predictors. The function aregImpute uses alternating conditional expectations (ACE) or additive and variance stabilization for regression (AVAS) functions to find which transformations to apply. Details for these functions were presented by Breiman [18], Tibshirani [19] and Banks [20]. Yadav & Roychoudhury [21] assessed imputation time for different size dataset and found hmisc to be a much faster algorithm than missForest and mice, so it is a good alternative if computation time is a concern.
Both mice and hmisc use predictive mean matching in their algorithms. Predictive mean matching works with the imputation model to impute values that are similar to an observed value. First, the predicted values for all cases in the target variable are calculated via the imputation model; then a donor is randomly selected from a small group of complete cases. The donor is an observed case whose predicted value is similar to the predicted value for the incomplete case. This works well under the assumption that missing values and complete cases with close predicted values follow similar distributions.
The missForest package in R also follows a somewhat FCS approach: it conducts imputation one target variable at a time instead of assuming a joint distribution for the data, but imputes only once. It is a nonparametric SI algorithm, generating a single imputed dataset based on predicted values from an RF model fit for each target variable. The algorithm was introduced with details by Stekhoven & Bühlmann [4], and presented promising results; it produced more accurate imputations than parametric mice, especially when dealing with categorical and numeric variables at once, and it was also found to be particularly useful in the presence of complex interactions and non-linear relationships amongst variables. MissForest was found to perform similarly to mice by Penone et al. [22] and Waljee et al. [23] with a life-history traits database and a missing laboratory data application, respectively. Although Shah et al. [24] found that imputation with missForest resulted in biased regression parameters on the analysis model when compared to mice, the analysis model used for our linkage to HIV care dataset is an RF, and no applied study was found with similar setup. The linkage to HIV care dataset may contain complex relationships amongst variables, difficult to explain with a parametric model, and has numeric and categorical variables that are incomplete, which are situations in which missForest is designed to perform well.
Simulations
Two sets of simulations were considered: the first one with samples taken from the complete portion of the dataset, with simulated missing data and then imputations were made considering only the information in each selected sample, and the second with samples taken from the full dataset, with missing data as observed in the linkage to HIV care dataset and imputations made considering information from the full dataset, not just the selected sample, taking advantage of a larger amount of data to inform the imputation process.
For the first set of simulations, nine scenarios were set up for each imputation method, using samples from the complete portion of the linkage to HIV care dataset. Samples sizes were 350, 700 and 1,050. Missing values were introduced in 20%, 40% and 60% of observations in the four target variables – alcohol consumption, depression symptoms, pregnancy and occupation. To simulate missing data, a completely randomized list of the same observed cases was set to have missing values for all four variables at once. Different configurations of tuning parameters were explored in preliminary simulations for each imputation method, and the best performing ones were selected. For MI methods, five multiple imputed datasets were generated each time. One thousand simulation runs were performed for each scenario.
To illustrate one simulation with an MI method, consider one sample with 350 cases taken from the original 9,099 complete cases via stratified random sampling to have a sample balanced for the output HIV status. Then, 20% of the cases are randomly selected to have missing values in the four target variables. Next, the MI algorithms generates five complete (imputed) datasets, using information contained in the 27 variables in the 350 cases. Then, five estimates of imputation and modeling precision are calculated and averaged; the measures used to evaluate precision are explained later in this section. Similar steps are repeated for 1,000 samples. The same is done for 40% and 60% missing values, and samples with 700 and 1,050 cases.
Because mice offers several options, three different configurations were tested and are referred to as mice, mice.rf and mice.cart. For mice, parameters were kept with their default values, so categorical variables were imputed using a politomic logistic regression model and numeric variables based on a linear regression model, both combined with predictive mean matching. When using mice.rf, a random forest was created to predict and impute in each target variable. For mice.cart, one optimal tree was created to impute the data for each of the five complete generated datasets.
For imputation with hmisc, transformations in the numeric variables were set to be linear. The predictive mean matching type was set in a way that predicted values for both target incomplete and complete observations come from a fit from the same bootstrap sample. For some samples, after simulating missing data, there were not enough values in all classes for all categorical variables for the hmisc algorithm to fit its imputation model. Because of that, the total number of simulations with this method were slightly smaller than 1,000 for the following specific scenarios - simulations with 350 cases, with 40% and 60% missing values successfully ran 995 and 861 times, respectively, and with 700 cases and 60% missing, 998 times.
For imputations with missForest, a single imputation method, the results of each simulation are based on one single forest with ten trees generated for each target variable. Other parameters were kept with their default values. Forests with up to 100 trees were evaluated, but differences in imputation and prediction accuracy were minimal and are not reported. Therefore, the number of trees for the forests in missForest and mice.rf imputation was set to ten, which allowed for similar precision to larger forests but improved imputation speed.
In this first set of simulations, mice.rf and mice.cart did not offer consistent advantages over mice with default configurations and were more time consuming, so these configurations were not included in the simulations with the full dataset. This first set of simulations was designed with the goal of investigating if any of the imputation methods performed significantly better or were more time efficient for selected sample sizes, missing percentages or performance measures, given the dataset of interest with its intrinsic characteristics of mixed numeric and categorical variables, collinearity, possible nonlinear relations and interaction between the variables.
Based on the results and information acquired from the initial imputation simulations, a second set of simulations was designed with the main goal being to assess the performance of imputation given the incomplete HIV dataset to be analyzed. In this step, two scenarios were considered after imputing into the full linkage to HIV care dataset; samples with 350 and 1,200 observations. These were chosen because there is a limited number of units in the positive classes for the HIV status variable, so in order to select samples with no class imbalance without oversampling or using synthetic samples, which was found to be not very effective for this dataset, and also with different HIV new positive cases between different samples, samples sizes were chosen to be slightly smaller than three times the number of observations in the minority class. Missing values were as observed for each target variable – about 60% for alcohol consumption and depression symptoms, 11% for occupation and 0.3% for pregnancy. Here, amelia, mice, hmisc and missForest were used to impute into the full linkage to HIV care dataset. The selected samples included negative HIV from the complete portion of the dataset only, and new and known positive HIV from the complete and imputed data. Imputation increases cases available for modeling in the HIV positive classes, from 140 to 454 new positives and from 413 to 653 known positives. Imputation adds some variability due to uncertainty of imputed values. Therefore, we chose to limit this variability to the inclusion of the additional positive cases, which are needed for increasing sample size for modeling, but there was no need to include the 14,284 originally incomplete cases in the negative HIV class, this is done under the assumption that negatives are well represented in the 8,546 originally complete cases. The setting including imputed observations from all HIV classes was tested and it resulted in worse results altogether - lower prediction accuracy, and no improvement in sensitivity, with much larger variability for predicting HIV negative and no advantage on predicting HIV positive classes.
The performance measures of interest are described in the following paragraphs. For the first set of simulations, because the real observed values are known, the precision of imputed values can be evaluated via misclassification rate, normalized root square mean squared errors (NRMSE), and mean rank. The calculations for obtaining these measures are described below:
- categorical target variables - pregnancy and occupation: misclassification rate is calculated by comparing the imputed value to the original value and obtaining the wrongly classified percentage,
- continuous target variables - alcohol consumption risk level and depressive symptoms score: NMRSE, which is obtained by calculating the mean squared distance between observed and imputed values, then taking the square root divided by the standard deviation. With V representing the variance, the ith imputed value, yi the ith observed value and n the total number of cases, then,
- all target variables together: mean rank for the methods, which is a proposed measure used to summarize how each method performs compared to the others. It is based on the NRMSE and misclassification rate mean values considering all simulations at once, and is obtained for one simulation scenario by the following steps:
- Obtain mean NRMSE for alcohol consumption and depression symptoms, and mean misclassification rate for pregnancy and occupation, separately by imputation method.
- Sort the methods by mean NRMSE for alcohol consumption. Rank methods from smallest to largest mean NRMSE, obtaining numbers 1 through 6 (there are six methods configurations), call this column Rank 1.
- Repeat step 2 for mean NRMSE for depressive symptoms, and call this Rank 2.
- Repeat step 2 for mean misclassification rate for pregnancy and occupation, to obtain Ranks 3 and 4.
- Take the average of the four ranks to obtain the mean rank for each method.
Although measurements based on MSE and misclassification rate can quantify the ability of each method to impute values similar to the observed ones, such measures could not answer practical analytical questions such as if the quality of the analysis models improved after imputing data versus when conducting CCA. Thus, we evaluate the performance of the analysis model before and after imputation as well, and this is done for both sets of simulations.
Several measures for assessing an RF model’s performance can be obtained via the information from the confusion matrix, such as accuracy, sensitivity, specificity and Receiving Operating Characteristic (ROC) curve. For our analysis model, prediction accuracy and sensitivity were the measures chosen to be extracted before and after imputation using the function confusionMatrix from the caret package in R [25]. Accuracy is the measure that represents the percentage of all cases in the test sample that are predicted in their correct HIV status by the analysis model. Sensitivity for a specific class is the ratio of the number of cases in that class that are correctly predicted over the total of cases observed in the class, and it varies from 0 to 1, being 1 perfect prediction, and it provides a way of evaluating the performance considering each HIV status separately. Accuracy and sensitivity calculations were done using the RF fit to predict a test sample with 500 observations, with 2% in each positive HIV status. This percentage was selected in order to be able to estimate how well the models might perform if applied for a population with approximately similar percentages in each HIV status to the ones observed in the full linkage to HIV care dataset - 1.9% and 2.7% for new and known HIV positive, respectively.
When modeling with the linkage to HIV care dataset, it was important to take into account that class imbalance is present in the outcome variable HIV status and, because of that, caution must be taken when evaluating prediction accuracy for the analysis models. Leevy et al. (2018) argued that in such cases, it is challenging to identify the minority class because a high class imbalance introduces a bias in favor of the majority class, and a classifier that predicts all cases in the majority class may produce a deceptively high accuracy rate. Thus, in order to avoid models that may attempt to maximize prediction accuracy by classifying all observations in the negative HIV class, samples for training the RF models are balanced for HIV status variable, with approximately a third in each class. When there are not enough observations in the new HIV positive class, synthetic cases are created via SMOTE (Synthetic Minority Over-sampling Technique). Smote is done via the SmoteClassif function from the UBL package in R. This is an attempt to increase the decision region for the minority class. This technique usually helps to increase the proportion of true positives predicted by the classification model, with increased proportion of false negatives being the payoff [26].
In preliminary simulations, the sensitivity for the class of the new HIV positive got increasingly worse with larger sample sizes, even when class imbalance was not present via the use of oversampling or SMOTE. Therefore, taking into consideration the limitations of modeling with a small number of cases in the minority class, the sample sizes for simulations were kept between about three times the number of available cases in the new HIV positive class.