The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

doi:10.21203/rs.3.rs-32456/v3

Download PDF

Research Article

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

https://doi.org/10.21203/rs.3.rs-32456/v3

This work is licensed under a CC BY 4.0 License

Version 3

posted

You are reading this latest preprint version

Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior.

Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Follow-back method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “k-Nearest Neighbor” (k-NN). k-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on 500 iterations after randomly deleting data for 5-15 consecutive days.

Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days from first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual.

Conclusions — k-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.

Psychiatry

k Nearest Neighbor

k-NN

Machine Learning

Data Missingness

Data Imputation

Prenatal Alcohol Consumption

Longitudinal Alcohol Consumption

Accurate assessment timing, frequency, and quantity of prenatal alcohol exposure (PAE) in longitudinal research studies is necessary for obtaining unbiased assessments of the effects on fetal and infant outcomes. Despite the importance from a public health point of view, there are currently no robust biomarkers for assessing timing and amount of alcohol exposure during pregnancy. Thus, we often remain reliant on maternal self-report of intake. Aside from issues associated with the accuracy of self-report, there are other methodological challenges in measuring alcohol exposure in longitudinal studies^1,2. Recording daily intake, while providing a temporally complete set of values, involves significant participant burden and is likely to impact consumption behavior ³. As a consequence, in many studies, alcohol consumption data are sampled at various times throughout pregnancy⁴. However, even when data for the specific collection time-points are complete, there is frequently missing information about intake during the intervals between samples. Addressing this missing data problem is critical when the exposure metrics of interest are both timing and amount in pregnancy ⁵.

The impact of missing data on the validity of estimates largely depends on the reasons data is missing ⁶. For example, pregnant women of low socioeconomic (SES) background are more likely to access antenatal care late in pregnancy, enroll late in research studies, and, therefore, have more missing data early in pregnancy ⁷. This is problematic as SES is an important determinant of drinking behavior during pregnancy ⁸. In addition, women often modify their consumption behavior following pregnancy recognition, which can happen at varying times during the first months of pregnancy. While some women stop or reduce drinking immediately upon pregnancy recognition, some heavy drinkers continue to binge in the first trimester or continue heavy drinking throughout the pregnancy ⁵. The accuracy of measures irrespective of the presence of missing data, such as the number of drinks consumed only on drinking days, may also provide biased overall estimates depending on when participants are interviewed. Accordingly, new approaches for managing the missing data problem are needed.

The Safe Passage Study conducted by the Prenatal Alcohol and SIDS and Stillbirth Network (PASS) was a prospective investigation of effects of alcohol exposure on multiple fetal and infant outcomes in Cape Town, South Africa and the Northern Plains, USA⁹. In this study, alcohol data were collected using a modification of the Timeline Followback Method (TLFB)¹⁰, in which mothers recorded drinking data on their last known drinking day and then, for the 30 days prior. While this method was deemed the best self-report system available, the approach necessarily generated a variable amount of missing data per participants. As an example, recent drinkers were more likely to have higher number of missing data points. Because drinking behavior during pregnancy vary by the timing of the pregnancy, as well as day of the week, and most participant had some daily drinking data missing out of the entire duration of pregnancy, traditional imputation methods such as last value carry forward and mean imputation was not applicable. In addition, due to the high volume of the total data and missing data (0.36 million missing out of 3.2 million person-days) performing imputation utilizing multiple imputation via chained equations (MICE) is computationally impractical.

In this paper we describe a method to impute the drinking values on missing days using a machine learning algorithm called k-nearest neighbor (k-NN). k-NN imputes missing values using pattern recognition without any distributional assumption about the underlying data¹¹. The k-NN algorithm has been used in imputation of missing data in several research studies in the healthcare field^12,13. In this paper, we provide the methodological details of the specific application of the k-NN method for PASS exposure data and the validation of these results.

The Safe Passage Study:

PASS was a prospective study of a cohort of pregnant women and their infants evaluating the role of prenatal alcohol exposure on incidence of adverse pregnancy outcomes including stillbirth, sudden infant death syndrome (SIDS), and fetal alcohol spectrum disorders (FASDs) of the surviving children. Between August 2007 and January 2015, 11,892 pregnant women were enrolled from antenatal clinics in Northern Plains, USA and Cape Town, South Africa. Women were eligible to participate in the study if they were pregnant with one or two fetuses, aged 16 years or older, were at gestational age 6 weeks or later at recruitment and spoke English or Afrikaans. Women were followed throughout the pregnancy and 1 year postnatally. Data on socioeconomic status, demographic, obstetric and medical history, periconceptional drinking and smoking were collected at the enrollment interview. Information on subsequent drinking during pregnancy was updated in study visits following enrollment. Written informed consent was obtained from all participants. Ethical approval was obtained for each participating PASS network site from their institutional review boards including Stellenbosch University, Sanford Health, the Indian Health Service and from participating Tribal Nations. All data collection ana analyses was performed in accordance with the guidelines of the participating institution’s ethical review boards. The research was also overseen by the PASS Network Steering Committee as well as an external Advisory and Safety Monitoring Board.

Alcohol data collection method and missing data

Alcohol exposure data were collected using a modified validated TLFB ¹⁰, which required participants to report details of their drinking on each day ±15 days from the last menstrual period (LMP) and, at each study visit, the thirty days prior to the last known drinking day. Data were collected on the types and number of drinks, size of the containers, amount of ice in the drink, how many people shared drink, and duration of the drinking episodes¹⁰. These data were then used to estimate the total amount of alcohol consumed and number of standard drinks on each reported drinking day ¹⁴. Data on drinking were collected during 1-4 prenatal study visits and 1 visit postpartum.

Let’s assume that subjects X, Y and Z were enrolled at the same gestational ages for their respective pregnancies. The alcohol consumption of Subject X is depicted in Panel A. Participant X is a non-drinker given no alcohol consumption is reported in the time interval which spans from LMP and recrutiment. Both Subject Y (Panel B) and Subjects Z (Panel C) did report at least an event of alcohol consumption in the same interval. Neverthless, the timing of alcohol intake is different for the participants, thus resulting is the absence (Subject Y) and presence (Subject Z) of data missingess. Considering Subject Y, the time interval between last alcohol intake and LMP is less or equal 30 days, thus there is no gap in alcohol consumption information, resulting in a complete timeline from recruitment back to LMP. On the contrary, Subject Z reported her last drinking event more recently with respect to Subject Y, thus the interval between last alcohol consumption and LMP is greater than 30 days. In this latter case, we have data missing by design of the assessment instrument.

Due to the nature of the modified TLFB data collection design, the number of days with missing data varied by participant as a function of the time of enrollment and number of subsequent visits. The number of days with missing drinking information also varied for each participant depending on the recentness of their drinking. Figure 1 shows examples of how such variation emerged during the time period between LMP and the recruitment visit depending on when the last drinking day occurred. Participants who did not drink, or whose last drinking day was prior to their LMP had no missing data (Figure 1, panel a). Participants who drank but quit drinking within 30 days of the last collection period, had less or no missing data (Figure 1, panel b). Participants who continued to drink, and who reported drinking information 30 days closest to the interview date, had missing information prior to the 30-day period of reported drinking (Figure 1, panel c). In this example, if Subject Z drank often, and possibly at a higher volume, she would have a greater number of missing days than women who drink less often. Thus, a summation of drinks over the days will reflect less than the actual consumption and analysis using this exposure metric will be biased.

The k - NN algorithm:

k-NN is a non-parametric machine learning algorithm which can be utilized to impute missing drinking information of a subject based on the information provided by other observations in a given database. Figure 2 displays the imputation of missing data for subject p based on the drinking information of subjects with drinking patterns most similar to that of p. Similarity in the drinking patterns of two subjects is measured using their cosine distance. In this hypothetical example, there are three subjects (q, r and s) for whom estimates of alcohol consumption were collected on three different days during pregnancy. For subject p information is missing for the third day. The nearest neighbor for subject p is subject q. The angle between p′O and q′O is zero which means that p and q have exactly the same drinking pattern, as they both consumed three times more drinks on day 1 than on day 2. The next nearest neighbor for subject p is subject r as the angle between them is small. In practice, it is computationally complex to calculate an angle and we can use the cosine as a good approximation. Once the k nearest neighbors of p are identified, the weighted average of the drinking data of these neighbors for the day for which p’s drinking data are missing is taken as the best estimate of the missing data. The weighted average is taken to assure that the neighbors nearer to p have more influence on the predicted value than the ones further away from it. We also scaled the imputed values to individual consumption level. In this example, the scaling adjustment is needed because though p and r have similar drinking patterns, p is heavier drinker than r. Details of the computation of cosine similarity and scaling adjustment are described in appendix 1.

Subjects p, q, r and s are mapped to points p′, q′, r′ and s′, respectively, in 2-dimensional space based on the two days for which data are available for all of them. If a subject had x drinks on day one and y drinks on day two then it is mapped to point (x, y, 0) on the 2-dimensional xy plane.

Data preparation:

We first converted the data to a single record (row) per person, where drinking values were separate variables (columns), one variable for each drinking day starting from day -15 (2 weeks prior to LMP) and ending at day 310 (maximum possible pregnancy length). We used the distance between a fixed date before the start of the study (Saturday, January 1, 2000), and the beginning of pregnancy (i.e., day -15) to find the day of the week the pregnancy started. This was then used to temporally align each subject prior to computation of the cosine distances. For example, when computing the nearest neighbors of a participant p whose pregnancy started on a Wednesday, if we encountered another participant q whose pregnancy started on a Monday, we aligned day -15 of p (a Wednesday) with day -13 of q (another Wednesday) and ignored the first two days (days -15 and -14) of q and the last two days (days 309 and 310) of p. The rationale behind this alignment is that the drinking behavior often varies by the day of the week¹⁵. We also Winsorized (capped) the outlier drinking values at 3 SD (21 for South Africa and 28 for Northern Plains sites) to reduce the impact of outlier values in determining the imputed values. As the pattern of drinking in subjects with data missing for a large number of days cannot be established; we excluded subjects with more than 200 days missing and those subjects who did not have any data in the first trimester.

Assessment of performance/validation

We validated our approach by comparing the actual values from a subset of subjects with no missing data in trimester 1 and the resulting imputed values obtained after random deletion of data for 5 to 15 consecutive days. The first trimester was selected for validation because the proportion of women drinking and the magnitude of their drinking is highest in trimester 1, particularly for the days before pregnancy recognition.

To identify optimum number of neighbors, we examined the root mean squared error (RMSE) for the predicted drinking values in the deleted segments (Figure 5, panel a) as follows,

We calculated the overall number of correctly imputed segments of drinking status as proportion of accurate classification and plotted it in a confusion matrix (figure 5, panel b). We ran 500 iterations to estimate the imputation accuracy for the chosen number of neighbors (k=5). We then calculated absolute differences between actual and predicted values and their confidence interval, for drinking and non-drinking days separately (figure 6).

Description of missing data

Participants contributed a total of 3.2 million person-days of observation in the study, of which 0.36 million (11.4 %) person-days were missing. Based on the data collected using the TLFB method about 45% of the participants (n=5396) had alcohol use data for every single day of their pregnancy while the remaining 55% (n=6492) had at least 1 day of alcohol-use data missing. Among the study participants 62% (n=7119) were drinkers, i.e., consumed at least 1 drink during pregnancy. Figure 3 shows the distribution of missing days per participant by study site. Overall, Northern Plains sites had fewer missing data, with over 50% of the participants having 30 or fewer days of missing data (figure 3). Most of the missing data in the South Africa site are from the early trimesters which largely reflects later enrollment at that site, while the majority of missing data in the NA site are in the 3rd trimester (data not shown).

Application of k-NN

Length of reference segment:

The largest possible reference segment in the PASS data set is 324 days, the maximum length of the pregnancy (310 days) plus 2 weeks before pregnancy. However, as mentioned in a previous section, women with complete data were more likely to be nondrinkers or light drinkers, hence using them as reference could produce an underestimate of true drinking values. The tradeoff between selecting a larger or smaller segment size is that smaller segment sizes (e.g., 7 days) allows more segments to be included as reference; but the smaller the segment becomes, the less accurate is the algorithm’s characterization of specific patterns of drinking. We determined that a reduction of segment sizes below 55 days did not increase available reference segments significantly (Figure 4). Thus, a segment size of approximately 2 months (55 days) retained the majority of the subjects in the reference pool without significantly diminishing the ability to identify their drinking patterns.

kNN Parameter Tuning – Number of neighbors, K

To identify the optimal number of neighbors prior to the operation of imputation, we varied the number of neighbors k from 1 to 10. Figure 5 shows the distribution of root mean square errors (RMSE) as a function of k in both study sites. For the prediction of nondrinking segments, K= 1 provided the lowest RMSE (panel a) and using K>1 (multiple neighbors) provided lower RMSE for the prediction of drinking segments. Although the mean RMSE value in the drinking segments decreased as the value of k is increased, mean RMSE after k =5 did not decrease substantially. Based on their relative performance in classification accuracy (Figure 5: panel b) in both sites, we concluded that k=5 provided reasonable accuracy for the imputation of both drinking and nondrinking days.

Imputation accuracy

We found the k-NN algorithm made exact predictions of drinking status for 76% drinking segments in the combined sample. The algorithm predicted nondrinking status accurately in 74% and 58% of the deleted segments in South Africa and Northern Plains respectively (data not shown). Using K=5, the approach predicted nondrinking segments within +/- 1 drinks for 78.6% of deleted segments in South Africa and 67.6% in the Northern Plains (Figure 6).

The differences and their 95% confidence intervals were obtained from 500 iterations to by deleting random segments of non-missing data of 5-15 days length from first trimester.

Average drinking after imputation:

Figure 7 shows the mean number of drinks per person by trimester before and after imputation. Following imputation, the mean number of drinks in South Africa increased by an average of 2 drinks in first trimester, while the increase for the Northern Plains sites was just below 1 drink in first trimester. Following imputation, the magnitude of increase in mean drinks in South Africa was higher than that in Northern Plains. The Northern Plains had fewer missing data than the South African site. Consequently, although many individual drinking values were changed, imputation had little effect on the average drinking values in Northern Plains sites.

The objective of this report was to illustrate the application of a machine learning algorithm to impute missing daily alcohol consumption data (at the daily level) in a prospective study among pregnant women. When pregnant women were asked about recent alcohol consumption during their prenatal visits, the originated days of missing data were an inherent consequence of the assessment methodology; and there were more missing data among recent drinkers. Thus, missing data points were not at random. We implemented an extension of a k-NN algorithm which accounted for the absence of a ‘typical/classic’ reference group, i.e., training data with no missing days. To our knowledge, the present report is the first to describe this method to impute missing alcohol consumption data in a longitudinal study among pregnant women. Validation of our approach showed high agreement between actual and predicted drinking values.

There is a paucity of studies addressing the potential bias introduced by missing data as well as a lack of methodological tool to test the validity of such studies in alcohol and drug use research ¹⁶. Published work has not yet reported the performance of any machine learning method for imputation of missing alcohol data. In a simulated dataset, Hallgren et al. compared methods of imputation including complete case analysis, last observation carried forward, the worst-case scenario of missing equals any drinking or heavy drinking, multiple imputation (MI), full information maximum likelihood (FIML) and concluded that MI and FIML yielded less biased estimates ^17,18. A recent study by Grittner et al. also found MI produced least bias based on their work in a longitudinal study in Denmark with five alcohol measurements over a period of five years ¹⁹. However, all methods in the study including the MI produced an underestimate of the actual drinking level. In addition, MI models are originally recommended for imputation of a single value per subject ²⁰. To impute irregularly spaced missing longitudinal data as in PASS, complex extensions of MI would be needed ²¹. The application of MI for such large dataset is computationally intensive. Despite the most recent advancement in the field (see Single Center Imputation from Multiple Chained Equation (SICE) approach)²² the application of such methods for imputation of daily level drinking (and other substance abuse data) appears impractical at present.

There are several advantages with using a non-parametric algorithm such as the k-NN algorithm for imputation of missing data. The majority of standard software packages rely on the assumption of normal distribution of multivariate data, therefore imputation of repeated longitudinal data in most software options is challenging ²¹. In the PASS dataset, alcohol data were collected at the daily level resulting in a high total volume of both data per participant and associated missing data. In the general population, alcohol consumption in pregnancy is highly skewed with the majority of the drinking concentrated in the first trimester. We observed this pattern in PASS, however, there was also a gradually decreasing drinking pattern among many study subjects. In such scenarios, a nonparametric method such as k-NN has the advantage of not making a distributional assumption.

The sample size required to achieve a reliable performance of k-NN imputation depends on the variability of the data being imputed. Specifically, the higher the variability in the sample, the greater the number of observations needed to derive making inference from that data. The choice of number of neighbors (K) depends on the nature of the problem under investigation, the available data as well as downstream analyses goals. On average, a higher number of neighbors results in a greater prediction accuracy but presents the limitation of standard deviations to be significantly inflated²³. In most scenarios, the use of a smaller K is a good compromise between performance and preservation of original distribution of the data. At the same time, it is advisable to derive goodness of fit measures to inform on the optimal number of neighbors. In fact, a restricted number of neighbors fails the purpose of detecting the most appropriate observations like the one under consideration. In addition, computational load in terms of neighbor searching and storing the training set must be taken into consideration²⁴. While it is not possible to provide a priori indication on the optimal number of neighbors for a given dataset without conducting a sensitivity analysis, in the context of our work we found K=5 as a reasonable trade-off between RMSE for drinking and non-drinking segments. Similar values of K were also reported in prior studies^25,26.

Recent studies demonstrated that RMSE can lead to incorrect inferences when used to evaluate distributional accuracy of imputation methods^27,28. In our work, RMSE was solely utilized to identify optimum number of neighbors. To evaluate imputation performance, we derived metrics widely accepted in the field of machine learning and artificial intelligence. As an example, the confusion matrix in Figure 5 shows the imputation accuracy in the binary drinker non-drinker classification and mean absolute difference between the predicted and actual daily drinking values with their confidence intervals. Our choice of imputation metric is dictated by the type of data to be imputed and our downstream analysis goal. While most prior studies reported imputation of alcohol data using a binary classification (yes/no), we imputed daily drinking data (of 3.2 million person-days, data were missing for 0.36 million). Therefore, it would be impractical and rather misleading to estimate the association between a known outcome (e.g., birth weight) using daily level data. The imputed data we derived are suitable for further processing such cluster analyses²⁹ for which evaluation metrics we used are adequate.

The k-NN algorithm is increasingly used to impute missing data in research with high volume data such as genetics and metabolomics studies ^30,31. In several recent reports the k-NN algorithm was shown to produce the smallest imputation error compared to methods such as mean and median imputation, Bayesian linear regression, K-Means, K-Medoids clustering algorithms ^32,33. However, some studies reported that simpler methods such as mean or median replacement were as adequate as methods like k-NN when imputation was followed by clustering of genetic data ³⁴. On the other hand, some have reported slightly better performance of random forest over k-NN to impute metabolomics data ³⁵. Another study noted improvement of performance of k-NN when additional information such as SES and demographic data were included in the prediction model ³⁶. We have used cosine distance to measure the similarity in the drinking patterns of two subjects. Chomboon et al evaluated 11 distance measures which showed that several other distance measures perform adequately³⁷. Future studies could evaluate performance of multiple distance measures in imputing alcohol data. The validity and accuracy of imputation will likely vary with the data type, data structure, mechanism of missingness, amount of missing data and the choice of downstream analyses. Therefore, future studies are needed to evaluate the performance of different machine learning algorithms to impute alcohol consumption data.

In this paper, we provide a comprehensive description of prenatal alcohol data imputation using timeline follow back method using k-NN with high accuracy. Data collection methods like Timeline follow back method³⁸ and food frequency questionnaires³⁹ that collect extensive consumption data, are prone to informative missing data. The methodologic details presented in this paper are of high relevance to various research areas including substance use research that suffer from missing data in longitudinal studies.

k-NN: K Nearest Neighbor

LMP: Last menstrual period

MI: multiple imputation

PAE: Prenatal alcohol exposure

PASS: Prenatal Alcohol and SIDS and Stillbirth Network

SES: Socioeconomic status

SIDS: sudden infant death syndrome

TLFB: Timeline Followback Method

FIML: full information maximum likelihood

RMSE: Root mean square errors

Ethics approval and consent to participate

Ethical approval was obtained from Stellenbosch University, Sanford Health, the Indian Health Service and from participating Tribal Nations.

IRB Protocol information for the PASS study:

Health Research Ethics Committee at Stellenbosch University (protocol # N0610210)

Institutional Review Board at Sanford Health (protocol #CR00000266)

Institutional Review Board at New York State Psychiatric Institute (protocol # 5338)

Consent for Publication

Written informed consent was obtained from all participants.
Availability of data and material

The datasets used and during the current study are available from the corresponding author on reasonable request.

Competing interests

The authors declare that they have no competing interests.

Author contribution

Ayesha Sania conceptualized and conducted the data analysis, interpreted the results and wrote the first draft of the manuscript. Nicolò Pini, Michael M. Myers participated in the data analyses and contributed in the manuscript writing. Lauren C. Shuffrey and Maristella Lucchini contributed in interpretation of the results and manuscript writing. Hein J. Odendaal and Morgan E. Nelson participated in study implementation and data collection and provided critical inputs on the manuscript. Amy J. Elliott and William P. Fifer are the principal investigators of the Safe Passage Study and contributed to the study design, implementation, analysis and interpretation of the data. All authors read and approved the final manuscript.

Acknowledgements

The Timeline Follow Back questionnaire and related training was done in collaboration with Joseph Jacobson, PhD, and Sandra Jacobson, PhD. We also acknowledge the contributions of Hannah Kinney, MD, Larry Burd, PhD, and Christopher Molteno, MD in designing and implementing the projects and data collection.

Source of support:

This research was supported by grants UH3OD023279, U01HD055154, U01HD045935, U01HD055155, and U01AA016501, issued by the Office of the Director, National Institutes of Health of the National Institutes of Health, National Institute on Alcohol Abuse and Alcoholism, Eunice Kennedy Shriver National Institute of Child Health and Human Development, and the National Institute on Deafness and Other Communication Disorders. Ayesha Sania is supported by UH3OD023279-05S1, re-entry supplement from Office of the Director, NIH, and Office of Research on Women Health (ORWH). The opinions expressed in this paper are those of the authors and do not necessarily represent the official views of the National Institutes of Health, the Eunice Kennedy Shriver National Institute of Child Health and Development, the National Institute on Alcohol Abuse and Alcoholism, or the National Institute on Deafness and Other Communication Disorders.

Dawson, D. A. Methodological issues in measuring alcohol use. Alcohol Res Health, 27, 18–29 (2003).
Feunekes, G. I., van 't Veer, P., van Staveren, W. A. & Kok, F. J. Alcohol intake assessment: the sober facts. Am J Epidemiol, 150, 105–112 (1999).
Buu, A. et al. Examining measurement reactivity in daily diary data on substance use: Results from a randomized experiment. Addict Behav, 102, 106198 https://doi.org/10.1016/j.addbeh.2019.106198 (2020).
McQuire, C. et al. Objective Measures of Prenatal Alcohol Exposure: A Systematic Review., 138, https://doi.org/10.1542/peds.2016-0517 (2016).
O'Keeffe, L. M. et al. Prevalence and predictors of alcohol use during pregnancy: findings from international multicentre cohort studies. BMJ Open, 5, e006323 https://doi.org/10.1136/bmjopen-2014-006323 (2015).
Rubin, D. Inference and missing data., 63 (3), 581–592 (1976).
Simkhada, B., Teijlingen, E. R., Porter, M. & Simkhada, P. Factors affecting the utilization of antenatal care in developing countries: systematic review of the literature. J Adv Nurs, 61, 244–260 https://doi.org/10.1111/j.1365-2648.2007.04532.x (2008).
Skagerstrom, J., Chang, G. & Nilsen, P. Predictors of drinking during pregnancy: a systematic review. J Womens Health (Larchmt), 20, 901–913 https://doi.org/10.1089/jwh.2010.2216 (2011).
Dukes, K. A. et al. The safe passage study: design, methods, recruitment, and follow-up approach. Paediatr Perinat Epidemiol, 28, 455–465 https://doi.org/10.1111/ppe.12136 (2014).
Dukes, K. et al. A modified Timeline Followback assessment to capture alcohol exposure in pregnant women: Application in the Safe Passage Study. Alcohol, 62, 17–27 https://doi.org/10.1016/j.alcohol.2017.02.174 (2017).
Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theor, 13, 21–27 https://doi.org/10.1109/tit.1967.1053964 (2006).
Elliott, P. & Hawthorne, G. Imputing missing repeated measures data: how should we proceed? Aust N Z J Psychiatry, 39, 575–582 https://doi.org/10.1080/j.1440-1614.2005.01629.x (2005).
Waljee, A. K. et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open, 3, https://doi.org/10.1136/bmjopen-2013-002847 (2013).
Brick, J. Standardization of alcohol calculations in research. Alcohol Clin Exp Res, 30, 1276–1287 https://doi.org/10.1111/j.1530-0277.2006.00155.x (2006).
Room, R. et al. Times to drink: cross-cultural variations in drinking in the rhythm of the week. Int J Public Health, 57, 107–117 https://doi.org/10.1007/s00038-011-0259-3 (2012).
Grigsby, T. J. & McLawhorn, J. Missing Data Techniques and the Statistical Conclusion Validity of Survey-Based Alcohol and Drug Use Research Studies: A Review and Comment on Reproducibility. Journal of Drug Issues, 49, 44–56 https://doi.org/10.1177/0022042618795878 (2018).
Hallgren, K. A. & Witkiewitz, K. Missing data in alcohol clinical trials: a comparison of methods. Alcohol Clin Exp Res, 37, 2152–2160 https://doi.org/10.1111/acer.12205 (2013).
Hallgren, K. A. et al. Missing Data in Alcohol Clinical Trials with Binary Outcomes. Alcohol Clin Exp Res, 40, 1548–1557 https://doi.org/10.1111/acer.13106 (2016).
Grittner, U., Gmel, G., Ripatti, S., Bloomfield, K. & Wicki, M. Missing value imputation in longitudinal measures of alcohol consumption. International journal of methods in psychiatric research, 20, 50–61 https://doi.org/10.1002/mpr.330 (2011).
Rubin, D. Multiple imputation for Nonresponse in surveys (Wiley, 1987).
Huque, M. H., Carlin, J. B., Simpson, J. A. & Lee, K. J. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med Res Methodol, 18, 168 https://doi.org/10.1186/s12874-018-0615-6 (2018).
Khan, S. I. & Hoque, A. S. M. L. SICE: an improved missing data imputation technique. Journal of Big Data, 7, 37 https://doi.org/10.1186/s40537-020-00313-w (2020).
Beretta, L. & Santaniello, A. Nearest neighbor imputation algorithms: a critical evaluation. BMC Medical Informatics and Decision Making, 16, 74 https://doi.org/10.1186/s12911-016-0318-z (2016).
Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009).
Jin, L. et al. A comparative study of evaluating missing value imputation methods in label-free proteomics. Sci. Rep, 11, 1760 https://doi.org/10.1038/s41598-021-81279-4 (2021).
Xu, X. et al. The ability of different imputation methods for missing values in mental measurement questionnaires. BMC Medical Research Methodology, 20, 42 https://doi.org/10.1186/s12874-020-00932-0 (2020).
Thurow, M. R. B., Dumpert, F., Ramosaj, B. & Pauly, M. Goodness (of fit) of Imputation Accuracy: The GoodImpact Analysis(2021).
Buuren, S. V. Flexible Imputation of Missing Data 2nd edn (Chapman and Hall, 2018).
Pini, N. et al. Cluster Analysis of Alcohol Consumption during Pregnancy in the Safe Passage Study. Annu Int Conf IEEE Eng Med Biol Soc 2019, 1338-1341, doi:10.1109/EMBC.2019.8857428 (2019).
Liao, S. G. et al. Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC Bioinformatics, 15, 346 https://doi.org/10.1186/s12859-014-0346-6 (2014).
Shah, J. S. et al. Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinformatics, 18, 114 https://doi.org/10.1186/s12859-017-1547-6 (2017).
Jadhav, A., Pramod, D. & Ramanathan, K. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Applied Artificial Intelligence, 33, 913–933 https://doi.org/10.1080/08839514.2019.1637138 (2019).
Mahboob, T., Ijaz, A., Shahzad, A. & Kalsoom, M. in 2018 12th International Conference on Open Source Systems and Technologies (ICOSST). 76-81.
de Souto, M. C., Jaskowiak, P. A. & Costa, I. G. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics, 16, 64 https://doi.org/10.1186/s12859-015-0494-3 (2015).
Kokla, M., Virtanen, J., Kolehmainen, M., Paananen, J. & Hanhineva, K. Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study. BMC Bioinformatics, 20, 492 https://doi.org/10.1186/s12859-019-3110-0 (2019).
Schwender, H. Imputing Missing Genotypes with Weighted k Nearest Neighbors. Journal of Toxicology and Environmental Health, Part A, 75, 438–446 https://doi.org/10.1080/15287394.2012.674910 (2012).
Chomboon, K., Chujai, P., Teerarassammee, P., Kerdprasop, K. & Kerdprasop, N. An Empirical Study of Distance Metrics for k-Nearest Neighbor Algorithm(2015).
Merrill, J. E., Fan, P., Wray, T. B. & Miranda, R. Jr. Assessment of Alcohol Use and Consequences: Comparison of Data Collected Via Timeline Followback Interview and Daily Reports. Journal of studies on alcohol and drugs, 81, 212–219 https://doi.org/10.15288/jsad.2020.81.212 (2020).
Parr, C. L. et al. Comparing methods for handling missing values in food-frequency questionnaires and proposing k nearest neighbours imputation: effects on dietary intake in the Norwegian Women and Cancer study (NOWAC). Public health nutrition, 11, 361–370 https://doi.org/10.1017/s1368980007000365 (2008).

No competing interests reported.

Appendix1.docx
Appendix 1

Download PDF

Version 3

posted

You are reading this latest preprint version

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

Status:

Version 3

Abstract

Figures

Introduction:

Methods:

The Safe Passage Study:

Alcohol data collection method and missing data

Data preparation:

Results:

Description of missing data

Length of reference segment:

kNN Parameter Tuning – Number of neighbors, K

Imputation accuracy

Discussion:

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 3