Study samples
The Early Autism Risk Longitudinal Investigation (EARLI) and Markers of Autism Risk Learning Early Signs (MARBLES) studies are enriched risk prospective pregnancy cohorts studying autism etiology.(51, 52) The EARLI study was reviewed and approved by Human Subjects Institutional Review Boards (IRBs) from each of the four study sites (Johns Hopkins University, Drexel University, University of California Davis, and Kaiser Permanente Northern California). The MARBLES protocol was reviewed and approved by the Human Subjects IRB from University of California Davis. Secondary data analysis for this manuscript was approved by the Human Subjects IRB for the University of Michigan. Both studies recruited mothers of children with clinically confirmed ASD who were early in a subsequent pregnancy or were trying to become pregnant. In EARLI there were 232 mothers with a subsequent sibling born through this study between November 2009 and March 2012. In MARBLES there were 389 enrolled mothers that gave birth to 425 subsequent siblings between December 1, 2006 and July 1, 2016.
Covariate and exposure assessment
Demographics, behaviors, and medical history were all collected via maternal self-report questionnaire. In these questionnaires, mothers were asked if they used prenatal vitamins for each month of pregnancy (yes/no). Data for the first month of pregnancy was collected at study enrollment.
Sample Collection and Processing
In EARLI, biospecimens including cord blood and placenta, were collected and archived for 213 births. Full thickness placental tissue from a central cotyledon was collected. Sterile punch biopsy forceps were used to extract placental samples from the maternal and fetal sides. Whole cord blood was also collected at delivery. Samples were transported to the Johns Hopkins Biological Repository (JHBR) for aliquoting and archiving (-80ºC). Placental DNA was extracted with the DNeasy Tissue Kit (Qiagen), and cord blood DNA was extracted using the DNA Midi kit (Qiagen, Valencia, CA). DNA was quantified using the Nanodrop (ThermoFisher Waltham, MA) and normalized DNA aliquots were sent to the Center for Inherited Disease Research (Johns Hopkins University). DNA samples were bisulfite treated and cleaned using the EZ DNA methylation gold kit (Zymo Research, Irvine, CA) according to manufacturer’s instructions. DNA was plated randomly and was assayed on the Infinium HumanMethylation450 BeadChip (Illumina, San Diego, CA).(53) Methylation control gradients and between-plate repeated tissue controls were used.
In MARBLES, placental tissues and cord blood were collected at delivery and immediately processed and frozen. The MARBLES study used orientation to the umbilical cord to ensure that all placenta samples were isolated from the chorionic villus from the fetal side of the placenta. Placental and cord blood samples were stored at -80ºC in the UC Davis repository. Cord blood and placenta samples were processed for methylation measures. Placenta DNA was extracted with Gentra Puregene kit (Qiagen) and cord blood DNA was extracted using the DNA Midi kit (Qiagen, Valencia, CA). Samples were bisulfite treated and cleaned using the EZ DNA methylation gold kit (Zymo Research, Irvine, CA). DNA was plated randomly and assayed on the Infinium HumanMethylationEPIC BeadChip (Illumina, San Diego, CA) at the Johns Hopkins SNP Center, a shared lab and informatics operation with the Center for Inherited Disease Research (Johns Hopkins University). DNA methylation control gradients and between-plate repeated tissue controls were used.
DNA Methylation Processing
For all methylation samples, we used the minfi library (version 1.30.0) in R (version 3.6) to process raw Illumina image files into noob background corrected methylation values.(54, 55) In EARLI, cord blood and placenta samples were run on the 450k array together, and thus preprocessed together. Samples from multiple births (cord blood n = 2 samples, placenta n = 6 samples), as well as samples with discordant DNA methylation predicted sex and observed infant sex were removed (cord blood n = 3, placenta n = 1). Probes with failed detection P-value (> 0.01) in > 5% of samples were removed (n = 661), as were probes documented as cross reactive (n = 29,153).(56) Y-chromosome probes (n = 48) were dropped from analysis. There were 170 EARLI cord blood samples, and 127 EARLI placenta samples with 455,650 probes that passed DNA methylation quality control.
In MARBLES, placenta and cord blood samples were run on the EPIC array at different times and preprocessed separately. First, we dropped cord blood samples from multiple births (cord blood n = 8 samples). Samples that had mismatched predicted sex were dropped (cord n = 3). For siblings not from multiple births, all but one sibling was dropped (cord n = 13). Probes were dropped if they had detection-p (p > 0.01) failure in greater than 5% of samples (n = 4,630). Cross reactive probes (n = 42,967) and Y chromosome probes were dropped from analysis (n = 379).(57) There were 243 MARBLES cord blood samples with 817,883 probes that passed DNA methylation quality control. Second, no placenta samples had mismatched predicted sex. There were no samples from multiple births, and all but one sample from siblings were dropped (placenta n = 2). Probes that failed detection-p in > 5% of samples (n = 1,699), cross reactive probes (n = 43,068), and remaining Y-chromosome probes were dropped from analysis (n = 84). There were 90 MARBLES placenta samples with 821,008 probes that passed quality control. Sample exclusion is summarized in Supplemental Fig. 1 and CpG probe exclusion is summarized in Supplemental Fig. 2.
In cord blood samples, cell type (CD8+ T-cell, CD4+ T-cell, natural killer cell, B-cell, monocyte, granulocyte, and nucleated red blood cell) proportions were estimated using a combined reference panel with the IDOL method.(58) In placenta samples, EpiDISH(59) was used to predict proportions of placenta cell types using a reference panel from the planet package: trophoblasts, stromal cells, Hofbauer cells, endothelial cells, nucleated red blood cells, and syncytiotrophoblasts.(60) Mean DNA methylation per person was calculated as the mean across all probes.(19) Mean DNA methylation restricted to probes in genomic regions (CpG island, shore, shelf, or open sea) were also computed. We used Illumina’s annotation of CpG sites to assign genomic regions (CpG island, CpG shore, CpG shelf, open sea).(61, 62)
Genetics Data Processing
In EARLI, genetic data were measured using the Omni5 + exome array (Illumina) at the John Hopkins University Center of Inherited Disease Research (CIDR). Data on 4.6 million single nucleotide polymorphisms (SNPs) were generated for 841 EARLI family biosamples (including maternal, paternal, proband, and infant samples) from 254 families and 18 HapMap control samples. Samples were processed together, but only data from infants with cord blood or placenta methylation were used. No samples had missing genotypes at > 3% of probes, or excess heterozygosity or homozygosity (4 standard deviations). Probes were removed if they had technical problems flagged by CIDR or missing genomic location information. Single nucleotide polymorphisms (SNPs) with minor allele frequencies > 5% were removed if they had a missingness rate > 5%, and SNPs with minor allele frequency < 5% were removed if they had a missingness rate > 1%. There were 2.5 million clean SNPs for 827 samples, which were merged with the 1000 genomes project (1000GP, version 5) data(63) and principal components for genetic ancestry were computed.
In MARBLES, SNPs on 643 infant and mother samples from 234 families were genotyped using the Illumina Mega array at the John Hopkins University Center of Inherited Disease Research (CIDR). Maternal and infant samples were processed together, but only data from infants with cord blood or placenta methylation measures were used. We again applied stringent quality control criteria(64) to the raw 1.75 million genotypes to remove low quality SNPs and samples. Our criteria include removal of samples with call rate < 98%, sex discrepancy, and relatedness (pi-hat < 0.18) to non-familial samples. We also filtered SNPs with call rates < 95%, excess hetero- or homozygosity, and minor allele frequency (MAF) < 5%. After quality control, 620 samples and 758 thousand SNPs remained. Principal components were calculated on genotype data, and these principal components were used to adjust for genetic ancestry in models.
Statistical Analyses
Study sample descriptive statistics were calculated for each of the four cohort/tissue groups. For continuous covariates (maternal age at delivery, gestational age, estimated cell proportions), we calculated mean and standard deviation. For categorical covariates (maternal education, infant sex, infant race/ethnicity), we provided number and frequency. We calculated the bivariate relationships between covariates and prenatal vitamin use. For continuous covariates, we used t-tests and for categorical covariates, we used chi-square tests.
In multivariable linear regression analyses, first, we examined array wide mean DNA methylation differences by prenatal vitamin intake in the first month of pregnancy. Regression models were adjusted for infant sex, maternal age, gestational age, maternal education, genetic ancestry principal components, and estimated cell proportions. Since cell composition estimates sum up to 100%, to avoid collinearity issues in models, we did not use all predicted cell types in models. For placenta, syncytiotrophoblast and Hofbauer proportions were used, while in cord blood granulocyte and nucleated red blood cell proportions were used. We visualized regression coefficients and 95% confidence intervals using forest plots.
Next, we performed epigenome-wide association analyses by examining single CpG site differential DNA methylation. We fit parallel linear models for each probe. Models were again adjusted for infant sex, maternal age, gestational age, maternal education, genetic ancestry PCs, and estimated cell proportions. Regression and empirical Bayes standard error moderation were performed using the limma package.(65) We visualized findings using volcano plots of effect estimates and -log10(p-values). For sites reaching a nominal p-value threshold (p < 0.05), we calculated the proportion of sites that had higher DNA methylation with prenatal vitamin intake and the proportion of sites with lower DNA methylation. To compare pairwise results across cohort and tissues, we examined Pearson correlations of effect estimates from these regression models, across all sites in common between the 450k and EPIC methylation arrays. We also focused on CpG sites that had p-value < 0.01 in multiple cohort/tissues, examining the overlap of such sites with an upset plot and the Pearson correlation of overlapping sites. For sites prioritized in multiple cohorts/tissues, we also used scatter plots to visualize the effect estimates.
We tested enrichment for gene ontology biological processes using the missMethyl package.(66) As input to missMethyl, CpG sites with p-value < 0.01 in the epigenome-wide regression models were used. We ranked the gene ontologies by significance, then computed a rank sum by adding the ranks across the four cohort/tissue groups. In addition, we tested for enrichment of chromatin state types using eFORGE 2.0.(67) The top 1000 CpG sites for each cohort/tissue analysis was input into the eFORGE site, with appropriate array platform chosen (450k for EARLI, EPIC for MARBLES), and Consolidated Roadmap Epigenomics – All 15-state marks and 1 kb window proximity options, with other options set at defaults.
Replication Testing
In EARLI, whole genome bisulfite sequencing (WGBS) data were available on 63 cord blood samples sequenced on the HiSeq X (51 overlapping with array methylation data). Sample processing and WGBS quality control and alignment for cord blood samples(68) and placenta samples(69, 70) have been previously discussed. In MARBLES, WGBS data were available for 91 placenta samples sequenced on HiSeq X Ten (89 overlapping with the array methylation data), 45 cord blood samples sequenced on HiSeq 4000 (30 overlapping with array methylation data), and 42 cord blood samples sequenced on the HiSeq X (35 overlapping with array methylation data).
Raw sequencing reads were preprocessed, mapped to human genome, and converted to CpG methylation count matrices with CpG_Me using default parameters(71–73). Reads were trimmed for adapters and methylation bias, aligned to the reference genome, and filtered for PCR duplicates. Methylation counts at all sites were extracted to Bismark cytosine methylation reports. The CpG_Me workflow incorporates Trim Galore, Bismark, Bowtie2, SAMtools, and MultiQC.(72, 74–77)
Differentially methylated regions (DMRs) were identified between prenatal vitamin intake during the first month of pregnancy with adjustment for sex and 10 permutation tests using DMRichR.(71) The DMR analysis utilizes a smoothing and weighting algorithm to weight regions with high coverage and low variation. Permutation testing was performed on pooled null distribution to generate empirical p-values as significant DMRs. The DMRichR pipeline utilized dmrseq and bsseq algorithms.(78, 79) To evaluate consistency with array results, we identified array probes within 5kb of the DMR and examined concordance in estimated directions of effect of those CpG probes and the DMR.