The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

doi:10.21203/rs.3.rs-32456/v2

Download PDF

Research Article

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

https://doi.org/10.21203/rs.3.rs-32456/v2

This work is licensed under a CC BY 4.0 License

Version 2

posted

You are reading this older preprint version

Read the latest preprint version →

Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior.

Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days.

Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual.

Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Mathematical Physics

Developmental Neuroscience

Epidemiology

K Nearest Neighbor

Machine Learning

Data Missingness

Data Imputation

Prenatal Alcohol Consumption

Longitudinal Alcohol Consumption

This research was supported by grants UH3OD023279, U01HD055154, U01HD045935, U01HD055155, and U01AA016501, issued by the Office of the Director, National Institutes of Health of the National Institutes of Health, National Institute on Alcohol Abuse and Alcoholism, Eunice Kennedy Shriver National Institute of Child Health and Human Development, and the National Institute on Deafness and Other Communication Disorders. The opinions expressed in this paper are those of the authors and do not necessarily represent the official views of the National Institutes of Health, the Eunice Kennedy Shriver National Institute of Child Health and Development, the National Institute on Alcohol Abuse and Alcoholism, or the National Institute on Deafness and Other Communication Disorders.

Accurate assessment timing, frequency, and quantity of prenatal alcohol exposure (PAE) in longitudinal research studies is necessary for obtaining unbiased assessments of the effects on fetal and infant outcomes. Despite the importance from a public health point of view, there are currently no robust biomarkers for assessing timing and amount of alcohol exposure during pregnancy. Thus, we often remain reliant on maternal self-report of intake. Aside from issues associated with the accuracy of self-report, there are other methodological challenges in measuring alcohol exposure in longitudinal studies^1,2. Recording daily intake, while providing a temporally complete set of values, involves significant participant burden and is likely to impact consumption behavior ³. As a consequence, in many studies, alcohol consumption data are sampled at various times throughout pregnancy⁴. However, even when data for the specific collection time-points are complete, there is frequently missing information about intake during the intervals between samples. Addressing this missing data problem is critical when the exposure metrics of interest are both timing and amount in pregnancy ⁵.

The impact of missing data on the validity of estimates largely depends on the reasons data is missing ⁶. For example, pregnant women of low socioeconomic (SES) background are more likely to access antenatal care late in pregnancy, enroll late in research studies, and, therefore, have more missing data early in pregnancy ⁷. This is problematic as SES is an important determinant of drinking behavior during pregnancy ⁸. In addition, women often modify their consumption behavior following pregnancy recognition, which can happen at varying times during the first months of pregnancy. While some women stop or reduce drinking immediately upon pregnancy recognition, some heavy drinkers continue to binge in the first trimester or continue heavy drinking throughout the pregnancy ⁵. The accuracy of measures irrespective of the presence of missing data, such as the number of drinks consumed only on drinking days, may also provide biased overall estimates depending on when participants are interviewed. Accordingly, new approaches for managing the missing data problem are needed.

The Safe Passage Study conducted by the Prenatal Alcohol and SIDS and Stillbirth Network (PASS) was a prospective investigation of effects of alcohol exposure on multiple fetal and infant outcomes in Cape Town, South Africa and the Northern Plains, USA⁹. In this study, alcohol data were collected using a modification of the Timeline Followback Method (TLFB)¹⁰, in which mothers recorded drinking data on their last known drinking day and then, for the 30 days prior. While this method was deemed the best self-report system available, the approach necessarily generated a variable amount of missing data.

Here, our goal was to impute the drinking values on missing days using a machine learning algorithm called k-nearest neighbor (K-NN). K-NN imputes missing values using pattern recognition without any distributional assumption about the underlying data¹¹. The K-NN algorithm has been used in imputation of missing data in several research studies in the healthcare field^12,13. In this paper, we provide the methodological details of the specific application of the K-NN method for PASS exposure data and the validation of these results.

The Safe Passage Study:

PASS was a prospective study of a cohort of pregnant women and their infants evaluating the role of prenatal alcohol exposure on incidence of adverse pregnancy outcomes including stillbirth, sudden infant death syndrome (SIDS), and fetal alcohol spectrum disorders (FASDs) of the surviving children. Between August 2007 and January 2015, 11,892 pregnant women were enrolled from antenatal clinics in Northern Plains, USA and Cape Town, South Africa. Women were eligible to participate in the study if they were pregnant with one or two fetuses, aged 16 years or older, were at gestational age 6 weeks or later at recruitment and spoke English or Afrikaans. Women were followed throughout the pregnancy and 1 year postnatally. Data on socioeconomic status, demographic, obstetric and medical history, periconceptional drinking and smoking were collected at the enrollment interview. Information on subsequent drinking during pregnancy was updated in study visits following enrollment. Written informed consent was obtained from all participants. Ethical approval was obtained for each participating PASS network site from their institutional review boards including Stellenbosch University, Sanford Health, the Indian Health Service and from participating Tribal Nations. All data collection ana analyses was performed in accordance with the guidelines of the participating institution’s ethical review boards. The research was also overseen by the PASS Network Steering Committee as well as an external Advisory and Safety Monitoring Board.

Alcohol data collection method and missing data

Alcohol exposure data were collected using a modified validated TLFB ¹⁰, which required participants to report details of their drinking on each day ± 15 days from the last menstrual period (LMP) and, at each study visit, the thirty days prior to the last known drinking day. Data were collected on the types and number of drinks, size of the containers, amount of ice in the drink, how many people shared drink, and duration of the drinking episodes¹⁰. These data were then used to estimate the total amount of alcohol consumed and number of standard drinks on each reported drinking day ¹⁴. Data on drinking were collected during 1–4 prenatal study visits and 1 visit postpartum.

Due to the nature of the modified TLFB data collection design, the number of days with missing data varied by participant as a function of the time of enrollment and number of subsequent visits. The number of days with missing drinking information also varied for each participant depending on the recentness of their drinking. Figure 1 shows examples of how such variation emerged during the time period between LMP and the recruitment visit depending on when the last drinking day occurred. Participants who did not drink, or whose last drinking day was prior to their LMP had no missing data (Fig. 1, panel a). Participants who drank but quit drinking within 30 days of the last collection period, had less or no missing data (Fig. 1, panel b). Participants who continued to drink, and who reported drinking information 30 days closest to the interview date, had missing information prior to the 30-day period of reported drinking (Fig. 1, panel c). In this example, if Subject Z drank often, and possibly at a higher volume, she would have a greater number of missing days than women who drink less often. Thus, a summation of drinks over the days will reflect less than the actual consumption and analysis using this exposure metric will be biased.

The KNN algorithm:

k-NN is a non-parametric machine learning algorithm which can be utilized to impute missing drinking information of a subject based on the information provided by other observations in a given database. Figure 2 displays the imputation of missing data for subject p based on the drinking information of subjects with drinking patterns most similar to that of p. Similarity in the drinking patterns of two subjects is measured using their cosine distance. In this hypothetical example, there are three subjects (q, r and s) for whom estimates of alcohol consumption were collected on three different days during pregnancy. For subject p information is missing for the third day. The nearest neighbor for subject p is subject q. The angle between p′O and q′O is zero which means that p and q have exactly the same drinking pattern, as they both consumed three times more drinks on day 1 than on day 2. The next nearest neighbor for subject p is subject r as the angle between them is small. In practice, it is computationally complex to calculate an angle and we can use the cosine as a good approximation. Once the k nearest neighbors of p are identified, the weighted average of the drinking data of these neighbors for the day for which p’s drinking data are missing is taken as the best estimate of the missing data. The weighted average is taken to assure that the neighbors nearer to p have more influence on the predicted value than the ones further away from it. We also scaled the imputed values to individual consumption level. In this example, the scaling adjustment is needed because though p and r have similar drinking patterns, p is heavier drinker than r. Details of the computation of cosine similarity and scaling adjustment are described in appendix 1.

Data preparation:

We first converted the data to a single record (row) per person, where drinking values were separate variables (columns), one variable for each drinking day starting from day − 15 (2 weeks prior to LMP) and ending at day 310 (maximum possible pregnancy length). We used the distance between a fixed date before the start of the study (Saturday, January 1, 2000), and the beginning of pregnancy (i.e., day − 15) to find the day of the week the pregnancy started. This was then used to temporally align each subject prior to computation of the cosine distances. For example, when computing the nearest neighbors of a participant p whose pregnancy started on a Wednesday, if we encountered another participant q whose pregnancy started on a Monday, we aligned day − 15 of p (a Wednesday) with day − 13 of q (another Wednesday) and ignored the first two days (days − 15 and − 14) of q and the last two days (days 309 and 310) of p. The rationale behind this alignment is that the drinking behavior often varies by the day of the week¹⁵. We also Winsorized (capped) the outlier drinking values at 3 SD (21 for South Africa and 28 for Northern Plains sites) to reduce the impact of outlier values in determining the imputed values. As the pattern of drinking in subjects with data missing for a large number of days cannot be established; we excluded subjects with more than 200 days missing and those subjects who did not have any data in the first trimester.

Assessment Of Performance /validation

We validated our approach by comparing actual values from a subset of subjects with no missing data in trimester 1 with imputed values obtained after random deletion of data for 5 to 15 consecutive days. The first trimester was selected for validation because the proportion of women drinking and the magnitude of their drinking is highest in trimester 1, particularly for the days before pregnancy recognition. To evaluate imputation performance, we computed the root mean squared error (RMSE) for the predicted drinking values in the deleted segments as follows,

where n is the length of a segment and for 1 ≤ i ≤ n, yi and are the actual and predicted value, respectively, of the ith entry of the segment. We then calculated overall prediction accuracy of drinking status as proportion of accurate classification and plotted it in a confusion matrix (Fig. 5, panel b). In addition, we calculated absolute differences between actual and predicted values for drinking and non-drinking days separately.

Description of missing data

Participants contributed a total of 3.2 million person-days of observation in the study, of which 0.36 million (11.4 %) person-days were missing. Based on the data collected using the TLFB method about 45% of the participants (n=5396) had alcohol use data for every single day of their pregnancy while the remaining 55% (n=6492) had at least 1 day of alcohol-use data missing. Among the study participants 62% (n=7119) were drinkers, i.e., consumed at least 1 drink during pregnancy. Figure 3 shows the distribution of missing days per participant by study site. Overall, Northern Plains sites had fewer missing data, with over 50% of the participants having 30 or fewer days of missing data (figure 3). Most of the missing data in the South Africa site are from the early trimesters which largely reflects later enrollment at that site, while the majority of missing data in the NA site are in the 3^rd trimester (data not shown).

Application of k-NN

Length of reference segment:

The largest possible reference segment in the PASS data set is 324 days, the maximum length of the pregnancy (310 days) plus 2 weeks before pregnancy. However, as mentioned in a previous section, women with complete data were more likely to be nondrinkers or light drinkers, hence using them as reference could produce an underestimate of true drinking values. The tradeoff between selecting a larger or smaller segment size is that smaller segment sizes (e.g., 7 days) allows more segments to be included as reference; but the smaller the segment becomes, the less accurate is the algorithm’s characterization of specific patterns of drinking. We determined that a reduction of segment sizes below 55 days did not increase available reference segments significantly (Figure 4). Thus, a segment size of approximately 2 months (55 days) retained the majority of the subjects in the reference pool without significantly diminishing the ability to identify their drinking patterns.

Number of neighbors, K

To identify the optimal number of neighbors for imputation, we varied the number of neighbors k from 1 to 10. Figure 5 shows the distribution of root mean square errors (RMSE) as a function of k in both study sites are combined. For the prediction of nondrinking segments, K= 1 provided the lowest RMSE (panel a), and using K>1 (multiple neighbors) provided lower RMSE for the prediction of drinking segments. Although the mean RMSE value in the drinking segments decreased as the value of k is increased, mean RMSE after k =5 did not decrease substantially. Based on their relative performance in classification accuracy (Figure 5: panel b) in both sites, we concluded that k=5 provided reasonable accuracy for both drinking and nondrinking days. We found the k-NN algorithm made exact predictions of drinking status for 76% drinking segments in the combined sample. The algorithm predicted nondrinking status accurately in 74% and 58% of the deleted segments in South Africa and Northern Plains respectively (data not shown). Using K=5, the approach predicted nondrinking segments within +/- 1 drinks for 80.5% of deleted segments in South Africa and 70.6% in the Northern Plains (Figure 5: panel c).

Average drinking after imputation:

Figure 6 shows the mean number of drinks per person by trimester before and after imputation. Following imputation, the mean number of drinks in South Africa increased by an average of 2 drinks in first trimester, while the increase for the Northern Plains sites was just below 1 drink in first trimester. Following imputation, the magnitude of increase in mean drinks in South Africa was higher than that in Northern Plains. The Northern Plains had fewer missing data than the South African site. Consequently, although many individual drinking values were changed, imputation had little effect on the average drinking values in Northern Plains sites.

The objective of this report was to describe the application of a machine learning algorithm to impute missing daily alcohol consumption data in a prospective study among pregnant women. When pregnant women were asked about recent alcohol consumption during their prenatal visits, days of missing data were an inherent consequence of the assessment methodology; and, there were more missing data among recent drinkers. Thus, missing data points were not at random. We implemented an extension of a kNN algorithm which accounted for the absence of a ‘typical/classic’ reference group, i.e., training data with no missing days. To our knowledge, the present report is the first to describe this method to impute missing alcohol consumption data in a longitudinal study among pregnant women. Validation of our approach showed high agreement between actual and predicted drinking values.

There is a paucity of studies on missing data techniques and their statistical validity in alcohol and drug use research studies ¹⁶. Published work has not yet reported the performance of any machine learning method for imputation of missing alcohol data. In a simulated dataset, Hallgren et al. compared methods of imputation including complete case analysis, last observation carried forward, the worst-case scenario of missing equals any drinking or heavy drinking, multiple imputation (MI), full information maximum likelihood (FIML) and concluded that MI and FIML yielded less biased estimates ^17,18. A recent study by Grittner et al. also found MI produced least bias based on their work in a longitudinal study in Denmark with five alcohol measurements over a period of five years ¹⁹. However, all methods in the study including the MI produced an underestimate of the actual drinking level. In addition, MI models are originally recommended for imputation of a single value per subject ²⁰. To impute irregularly spaced missing longitudinal data as in PASS, complex extensions of MI would be needed ²¹.

There are several advantages with using a non-parametric algorithm such as the kNN algorithm for imputation of missing data. The majority of standard software packages rely on the assumption of normal distribution of multivariate data, therefore imputation of repeated longitudinal data in most software options is challenging ²¹. In PASS, alcohol data were collected at the daily level resulting in a high total volume of both data per participant and missing data. In the general population, alcohol consumption in pregnancy is highly skewed with the majority of the drinking concentrated in the first trimester. We observed this pattern in PASS, however, there was also a gradually decreasing drinking pattern among many study subjects. In such scenarios, a nonparametric method such as kNN has the advantage of not making a distributional assumption.

The kNN algorithm is increasingly used to impute missing data in research with high volume data such as genetics and metabolomics studies ^22,23. In several recent reports the kNN algorithm was shown to produce the smallest imputation error compared to methods such as mean and median imputation, Bayesian linear regression, K-Means, K-Medoids clustering algorithms ^24,25. However, some studies reported that simpler methods such as mean or median replacement were as adequate as methods like kNN when imputation was followed by clustering of genetic data ²⁶. On the other hand, some have reported slightly better performance of random forest over kNN to impute metabolomics data ²⁷. Another study noted improvement of performance of kNN when additional information such as SES and demographic data were included in the prediction model ²⁸. We have used cosine distance to measure the similarity in the drinking patterns of two subjects. Chomboon et al evaluated 11 distance measures which showed that several other distance measures perform adequately²⁹. Future studies could evaluate performance of multiple distance measures in imputing alcohol data. The validity and accuracy of imputation methods will likely vary with the data type, data structure, mechanism of missingness and amount of missing data. Therefore, future studies need to evaluate the performance of different machine learning algorithms to impute alcohol consumption data.

K-NN

K Nearest Neighbor

LMP

Last menstrual period

multiple imputation

PAE

Prenatal alcohol exposure

PASS

Prenatal Alcohol and SIDS and Stillbirth Network

SES

Socioeconomic status

SIDS

sudden infant death syndrome

TLFB

Timeline Followback Method

FIML

full information maximum likelihood

RMSE

Root mean square errors

Ethics approval and consent to participate

Ethical approval was obtained from Stellenbosch University, Sanford Health, the Indian Health Service and from participating Tribal Nations.

IRB Protocol information for the PASS study:

Health Research Ethics Committee at Stellenbosch University (protocol # N0610210)

Institutional Review Board at Sanford Health (protocol #CR00000266)

Institutional Review Board at New York State Psychiatric Institute (protocol # 5338)

Consent for Publication

Written informed consent was obtained from all participants.

Availability of data and material

The datasets used and during the current study are available from the corresponding author on reasonable request.

Competing interests

The authors declare that they have no competing interests.

Author contribution

Ayesha Sania conceptualized and conducted the data analysis, interpreted the results and wrote the first draft of the manuscript. Nicolò Pini, Michael M. Myers participated in the data analyses and contributed in the manuscript writing. Lauren C. Shuffrey and Maristella Lucchini contributed in interpretation of the results and manuscript writing. Hein J. Odendaal and Morgan E. Nelson participated in study implementation and data collection and provided critical inputs on the manuscript. Amy J. Elliott and William P. Fifer are the principal investigators of the Safe Passage Study and contributed to the study design, implementation, analysis and interpretation of the data. All authors read and approved the final manuscript.

Acknowledgements

The Timeline Follow Back questionnaire and related training was done in collaboration with Joseph Jacobson, PhD, and Sandra Jacobson, PhD. We also acknowledge the contributions of Hannah Kinney, MD, Larry Burd, PhD, and Christopher Molteno, MD in designing and implementing the projects and data collection.

Dawson, D. A. Methodological issues in measuring alcohol use. Alcohol Res Health. 27, 18–29 (2003).
Feunekes, G. I., van 't Veer, P., van Staveren, W. A. & Kok, F. J. Alcohol intake assessment: the sober facts. Am J Epidemiol. 150, 105–112 (1999).
Buu, A. et al. Examining measurement reactivity in daily diary data on substance use: Results from a randomized experiment. Addict Behav. 102, 106198 https://doi.org/10.1016/j.addbeh.2019.106198 (2020).
McQuire, C. et al. Objective Measures of Prenatal Alcohol Exposure: A Systematic Review. Pediatrics. 138, https://doi.org/10.1542/peds.2016-0517 (2016).
O'Keeffe, L. M. et al. Prevalence and predictors of alcohol use during pregnancy: findings from international multicentre cohort studies. BMJ Open. 5, e006323 https://doi.org/10.1136/bmjopen-2014-006323 (2015).
Rubin, D. Inference and missing data. Biometrika. 63 (3), 581–592 (1976).
Simkhada, B., Teijlingen, E. R., Porter, M. & Simkhada, P. Factors affecting the utilization of antenatal care in developing countries: systematic review of the literature. J Adv Nurs. 61, 244–260 https://doi.org/10.1111/j.1365-2648.2007.04532.x (2008).
Skagerstrom, J., Chang, G. & Nilsen, P. Predictors of drinking during pregnancy: a systematic review. J Womens Health (Larchmt). 20, 901–913 https://doi.org/10.1089/jwh.2010.2216 (2011).
Dukes, K. A. et al. The safe passage study: design, methods, recruitment, and follow-up approach. Paediatr Perinat Epidemiol. 28, 455–465 https://doi.org/10.1111/ppe.12136 (2014).
Dukes, K. et al. A modified Timeline Followback assessment to capture alcohol exposure in pregnant women: Application in the Safe Passage Study. Alcohol. 62, 17–27 https://doi.org/10.1016/j.alcohol.2017.02.174 (2017).
Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13, 21–27 https://doi.org/10.1109/tit.1967.1053964 (2006).
Elliott, P. & Hawthorne, G. Imputing missing repeated measures data: how should we proceed? Aust N Z J Psychiatry. 39, 575–582 https://doi.org/10.1080/j.1440-1614.2005.01629.x (2005).
Waljee, A. K. et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 3, https://doi.org/10.1136/bmjopen-2013-002847 (2013).
Brick, J. Standardization of alcohol calculations in research. Alcohol Clin Exp Res. 30, 1276–1287 https://doi.org/10.1111/j.1530-0277.2006.00155.x (2006).
Room, R. et al. Times to drink: cross-cultural variations in drinking in the rhythm of the week. Int J Public Health. 57, 107–117 https://doi.org/10.1007/s00038-011-0259-3 (2012).
Grigsby, T. J. & McLawhorn, J. Missing Data Techniques and the Statistical Conclusion Validity of Survey-Based Alcohol and Drug Use Research Studies: A Review and Comment on Reproducibility. Journal of Drug Issues. 49, 44–56 https://doi.org/10.1177/0022042618795878 (2018).
Hallgren, K. A. & Witkiewitz, K. Missing data in alcohol clinical trials: a comparison of methods. Alcohol Clin Exp Res. 37, 2152–2160 https://doi.org/10.1111/acer.12205 (2013).
Hallgren, K. A. et al. Missing Data in Alcohol Clinical Trials with Binary Outcomes. Alcohol Clin Exp Res. 40, 1548–1557 https://doi.org/10.1111/acer.13106 (2016).
Grittner, U., Gmel, G., Ripatti, S., Bloomfield, K. & Wicki, M. Missing value imputation in longitudinal measures of alcohol consumption. Int J Methods Psychiatr Res. 20, 50–61 https://doi.org/10.1002/mpr.330 (2011).
Rubin, D. Multiple imputation for Nonresponse in surveys(Wiley, 1987).
Huque, M. H., Carlin, J. B., Simpson, J. A. & Lee, K. J. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med Res Methodol. 18, 168 https://doi.org/10.1186/s12874-018-0615-6 (2018).
Liao, S. G. et al. Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC Bioinformatics. 15, 346 https://doi.org/10.1186/s12859-014-0346-6 (2014).
Shah, J. S. et al. Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinformatics. 18, 114 https://doi.org/10.1186/s12859-017-1547-6 (2017).
Jadhav, A., Pramod, D. & Ramanathan, K. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Applied Artificial Intelligence. 33, 913–933 https://doi.org/10.1080/08839514.2019.1637138 (2019).
Mahboob, T., Ijaz, A., Shahzad, A. & Kalsoom, M. in 2018 12th International Conference on Open Source Systems and Technologies (ICOSST). 76–81.
de Souto, M. C., Jaskowiak, P. A. & Costa, I. G. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics. 16, 64 https://doi.org/10.1186/s12859-015-0494-3 (2015).
Kokla, M., Virtanen, J., Kolehmainen, M., Paananen, J. & Hanhineva, K. Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study. BMC Bioinformatics. 20, 492 https://doi.org/10.1186/s12859-019-3110-0 (2019).
Schwender, H. Imputing Missing Genotypes with Weighted k Nearest Neighbors. Journal of Toxicology and Environmental Health, Part A. 75, 438–446 https://doi.org/10.1080/15287394.2012.674910 (2012).
Chomboon, K., Chujai, P., Teerarassammee, P., Kerdprasop, K. & Kerdprasop, N. An Empirical Study of Distance Metrics for k-Nearest Neighbor Algorithm. (2015).

No competing interests reported.

Appendix1.docx

Download PDF

Version 2

posted

You are reading this older preprint version

Read the latest preprint version →

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

Status:

Version 2

Abstract

Figures

Source Of Support

Introduction

Methods

The Safe Passage Study:

Alcohol data collection method and missing data

The KNN algorithm:

Data preparation:

Assessment Of Performance /validation

Results

Discussion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 2