The objective of this report was to illustrate the application of a machine learning algorithm to impute missing daily alcohol consumption data (at the daily level) in a prospective study among pregnant women. When pregnant women were asked about recent alcohol consumption during their prenatal visits, the originated days of missing data were an inherent consequence of the assessment methodology; and there were more missing data among recent drinkers. Thus, missing data points were not at random. We implemented an extension of a k-NN algorithm which accounted for the absence of a ‘typical/classic’ reference group, i.e., training data with no missing days. To our knowledge, the present report is the first to describe this method to impute missing alcohol consumption data in a longitudinal study among pregnant women. Validation of our approach showed high agreement between actual and predicted drinking values.
There is a paucity of studies addressing the potential bias introduced by missing data as well as a lack of methodological tool to test the validity of such studies in alcohol and drug use research 16. Published work has not yet reported the performance of any machine learning method for imputation of missing alcohol data. In a simulated dataset, Hallgren et al. compared methods of imputation including complete case analysis, last observation carried forward, the worst-case scenario of missing equals any drinking or heavy drinking, multiple imputation (MI), full information maximum likelihood (FIML) and concluded that MI and FIML yielded less biased estimates 17,18. A recent study by Grittner et al. also found MI produced least bias based on their work in a longitudinal study in Denmark with five alcohol measurements over a period of five years 19. However, all methods in the study including the MI produced an underestimate of the actual drinking level. In addition, MI models are originally recommended for imputation of a single value per subject 20. To impute irregularly spaced missing longitudinal data as in PASS, complex extensions of MI would be needed 21. The application of MI for such large dataset is computationally intensive. Despite the most recent advancement in the field (see Single Center Imputation from Multiple Chained Equation (SICE) approach)22 the application of such methods for imputation of daily level drinking (and other substance abuse data) appears impractical at present.
There are several advantages with using a non-parametric algorithm such as the k-NN algorithm for imputation of missing data. The majority of standard software packages rely on the assumption of normal distribution of multivariate data, therefore imputation of repeated longitudinal data in most software options is challenging 21. In the PASS dataset, alcohol data were collected at the daily level resulting in a high total volume of both data per participant and associated missing data. In the general population, alcohol consumption in pregnancy is highly skewed with the majority of the drinking concentrated in the first trimester. We observed this pattern in PASS, however, there was also a gradually decreasing drinking pattern among many study subjects. In such scenarios, a nonparametric method such as k-NN has the advantage of not making a distributional assumption.
The sample size required to achieve a reliable performance of k-NN imputation depends on the variability of the data being imputed. Specifically, the higher the variability in the sample, the greater the number of observations needed to derive making inference from that data. The choice of number of neighbors (K) depends on the nature of the problem under investigation, the available data as well as downstream analyses goals. On average, a higher number of neighbors results in a greater prediction accuracy but presents the limitation of standard deviations to be significantly inflated23. In most scenarios, the use of a smaller K is a good compromise between performance and preservation of original distribution of the data. At the same time, it is advisable to derive goodness of fit measures to inform on the optimal number of neighbors. In fact, a restricted number of neighbors fails the purpose of detecting the most appropriate observations like the one under consideration. In addition, computational load in terms of neighbor searching and storing the training set must be taken into consideration24. While it is not possible to provide a priori indication on the optimal number of neighbors for a given dataset without conducting a sensitivity analysis, in the context of our work we found K=5 as a reasonable trade-off between RMSE for drinking and non-drinking segments. Similar values of K were also reported in prior studies25,26.
Recent studies demonstrated that RMSE can lead to incorrect inferences when used to evaluate distributional accuracy of imputation methods27,28. In our work, RMSE was solely utilized to identify optimum number of neighbors. To evaluate imputation performance, we derived metrics widely accepted in the field of machine learning and artificial intelligence. As an example, the confusion matrix in Figure 5 shows the imputation accuracy in the binary drinker non-drinker classification and mean absolute difference between the predicted and actual daily drinking values with their confidence intervals. Our choice of imputation metric is dictated by the type of data to be imputed and our downstream analysis goal. While most prior studies reported imputation of alcohol data using a binary classification (yes/no), we imputed daily drinking data (of 3.2 million person-days, data were missing for 0.36 million). Therefore, it would be impractical and rather misleading to estimate the association between a known outcome (e.g., birth weight) using daily level data. The imputed data we derived are suitable for further processing such cluster analyses29 for which evaluation metrics we used are adequate.
The k-NN algorithm is increasingly used to impute missing data in research with high volume data such as genetics and metabolomics studies 30,31. In several recent reports the k-NN algorithm was shown to produce the smallest imputation error compared to methods such as mean and median imputation, Bayesian linear regression, K-Means, K-Medoids clustering algorithms 32,33. However, some studies reported that simpler methods such as mean or median replacement were as adequate as methods like k-NN when imputation was followed by clustering of genetic data 34. On the other hand, some have reported slightly better performance of random forest over k-NN to impute metabolomics data 35. Another study noted improvement of performance of k-NN when additional information such as SES and demographic data were included in the prediction model 36. We have used cosine distance to measure the similarity in the drinking patterns of two subjects. Chomboon et al evaluated 11 distance measures which showed that several other distance measures perform adequately37. Future studies could evaluate performance of multiple distance measures in imputing alcohol data. The validity and accuracy of imputation will likely vary with the data type, data structure, mechanism of missingness, amount of missing data and the choice of downstream analyses. Therefore, future studies are needed to evaluate the performance of different machine learning algorithms to impute alcohol consumption data.
In this paper, we provide a comprehensive description of prenatal alcohol data imputation using timeline follow back method using k-NN with high accuracy. Data collection methods like Timeline follow back method38 and food frequency questionnaires39 that collect extensive consumption data, are prone to informative missing data. The methodologic details presented in this paper are of high relevance to various research areas including substance use research that suffer from missing data in longitudinal studies.