The Safe Passage Study:
PASS was a prospective study of a cohort of pregnant women and their infants evaluating the role of prenatal alcohol exposure on incidence of adverse pregnancy outcomes including stillbirth, sudden infant death syndrome (SIDS), and fetal alcohol spectrum disorders (FASDs) of the surviving children. Between August 2007 and January 2015, 11,892 pregnant women were enrolled from antenatal clinics in Northern Plains, USA and Cape Town, South Africa. Women were eligible to participate in the study if they were pregnant with one or two fetuses, aged 16 years or older, were at gestational age 6 weeks or later at recruitment and spoke English or Afrikaans. Women were followed throughout the pregnancy and 1 year postnatally. Data on socioeconomic status, demographic, obstetric and medical history, periconceptional drinking and smoking were collected at the enrollment interview. Information on subsequent drinking during pregnancy was updated in study visits following enrollment. Written informed consent was obtained from all participants. Ethical approval was obtained for each participating PASS network site from their institutional review boards including Stellenbosch University, Sanford Health, the Indian Health Service and from participating Tribal Nations. All data collection ana analyses was performed in accordance with the guidelines of the participating institution’s ethical review boards. The research was also overseen by the PASS Network Steering Committee as well as an external Advisory and Safety Monitoring Board.
Alcohol data collection method and missing data
Alcohol exposure data were collected using a modified validated TLFB 10, which required participants to report details of their drinking on each day ± 15 days from the last menstrual period (LMP) and, at each study visit, the thirty days prior to the last known drinking day. Data were collected on the types and number of drinks, size of the containers, amount of ice in the drink, how many people shared drink, and duration of the drinking episodes10. These data were then used to estimate the total amount of alcohol consumed and number of standard drinks on each reported drinking day 14. Data on drinking were collected during 1–4 prenatal study visits and 1 visit postpartum.
Due to the nature of the modified TLFB data collection design, the number of days with missing data varied by participant as a function of the time of enrollment and number of subsequent visits. The number of days with missing drinking information also varied for each participant depending on the recentness of their drinking. Figure 1 shows examples of how such variation emerged during the time period between LMP and the recruitment visit depending on when the last drinking day occurred. Participants who did not drink, or whose last drinking day was prior to their LMP had no missing data (Fig. 1, panel a). Participants who drank but quit drinking within 30 days of the last collection period, had less or no missing data (Fig. 1, panel b). Participants who continued to drink, and who reported drinking information 30 days closest to the interview date, had missing information prior to the 30-day period of reported drinking (Fig. 1, panel c). In this example, if Subject Z drank often, and possibly at a higher volume, she would have a greater number of missing days than women who drink less often. Thus, a summation of drinks over the days will reflect less than the actual consumption and analysis using this exposure metric will be biased.
The KNN algorithm:
k-NN is a non-parametric machine learning algorithm which can be utilized to impute missing drinking information of a subject based on the information provided by other observations in a given database. Figure 2 displays the imputation of missing data for subject p based on the drinking information of subjects with drinking patterns most similar to that of p. Similarity in the drinking patterns of two subjects is measured using their cosine distance. In this hypothetical example, there are three subjects (q, r and s) for whom estimates of alcohol consumption were collected on three different days during pregnancy. For subject p information is missing for the third day. The nearest neighbor for subject p is subject q. The angle between p′O and q′O is zero which means that p and q have exactly the same drinking pattern, as they both consumed three times more drinks on day 1 than on day 2. The next nearest neighbor for subject p is subject r as the angle between them is small. In practice, it is computationally complex to calculate an angle and we can use the cosine as a good approximation. Once the k nearest neighbors of p are identified, the weighted average of the drinking data of these neighbors for the day for which p’s drinking data are missing is taken as the best estimate of the missing data. The weighted average is taken to assure that the neighbors nearer to p have more influence on the predicted value than the ones further away from it. We also scaled the imputed values to individual consumption level. In this example, the scaling adjustment is needed because though p and r have similar drinking patterns, p is heavier drinker than r. Details of the computation of cosine similarity and scaling adjustment are described in appendix 1.
Data preparation:
We first converted the data to a single record (row) per person, where drinking values were separate variables (columns), one variable for each drinking day starting from day − 15 (2 weeks prior to LMP) and ending at day 310 (maximum possible pregnancy length). We used the distance between a fixed date before the start of the study (Saturday, January 1, 2000), and the beginning of pregnancy (i.e., day − 15) to find the day of the week the pregnancy started. This was then used to temporally align each subject prior to computation of the cosine distances. For example, when computing the nearest neighbors of a participant p whose pregnancy started on a Wednesday, if we encountered another participant q whose pregnancy started on a Monday, we aligned day − 15 of p (a Wednesday) with day − 13 of q (another Wednesday) and ignored the first two days (days − 15 and − 14) of q and the last two days (days 309 and 310) of p. The rationale behind this alignment is that the drinking behavior often varies by the day of the week15. We also Winsorized (capped) the outlier drinking values at 3 SD (21 for South Africa and 28 for Northern Plains sites) to reduce the impact of outlier values in determining the imputed values. As the pattern of drinking in subjects with data missing for a large number of days cannot be established; we excluded subjects with more than 200 days missing and those subjects who did not have any data in the first trimester.
Assessment Of Performance /validation
We validated our approach by comparing actual values from a subset of subjects with no missing data in trimester 1 with imputed values obtained after random deletion of data for 5 to 15 consecutive days. The first trimester was selected for validation because the proportion of women drinking and the magnitude of their drinking is highest in trimester 1, particularly for the days before pregnancy recognition. To evaluate imputation performance, we computed the root mean squared error (RMSE) for the predicted drinking values in the deleted segments as follows,
where n is the length of a segment and for 1 ≤ i ≤ n, yi and are the actual and predicted value, respectively, of the ith entry of the segment. We then calculated overall prediction accuracy of drinking status as proportion of accurate classification and plotted it in a confusion matrix (Fig. 5, panel b). In addition, we calculated absolute differences between actual and predicted values for drinking and non-drinking days separately.