H3K9me3 ChIP-seq of human PBMCs
Deidentified PBMC samples from 18-week-old children (n = 15) and mothers (n = 14) enrolled in the PROVIDE study (n = 700) [42] were obtained in collaboration with icddr,b (International Centre for Diarrhoeal Disease Research, Bangladesh) in Dhaka, Bangladesh. The number of samples was established based on previous work [35, 36], and samples were selected to represent the PROVIDE spectrum of HAZ and ΔHAZ scores, as well as to be balanced in sex representation (Additional file 1: Fig. S1,2). ChIP-seq was performed as previously described in detail in [35]. Briefly, PBMC samples were fixed with formaldehyde, chromatin isolated and sheared, 0.02% (microgram/microgram) Drosophila melanogaster chromatin (cat #53083, Active Motif, Carlsbad, CA, USA) was added for spike-in normalization [52]. H3K9me3 antibodies (12.5 µl per 100 µg chromatin protein solution, Abcam Cat# AB8898, RRID: AB_30684) were used for DNA fragment selection by immunoprecipitation. Sequencing libraries were constructed using the Illumina TruSeq ChIP Library Preparation Kit following the manufacturer instructions. Libraries were sequenced using an Illumina NextSeq500 instrument with high-capacity cartridge in the University of Virginia Genome Analysis and Technology Core, RRID:SCR_018883, yielding between 52 million to 142 million 150 bp single-end reads per sample (Additional file 1: Table S1).
H3K9me3 ChIP-seq data preprocessing
Bowtie2 (2.2.6) [53] with default settings was used to map raw FASTQ files to the hg19 reference genome obtained with refgenie (0.9.3) [54]. SAMtools (0.1.19-44428cd) [55] were used to convert alignment SAM files to BAM format (function: view -S -b), sort (function: sort), and index (function: index) BAM files, report number of mapped (function: view -c -F 4) and unmapped (function: view -c -f 4) reads and to filter out unmapped reads (function: view -h -F 4 -b). ENCODE blacklisted sites [56] were removed with bedtools (v2.26.0) [57] intersect function with argument -v. The quality of both raw FASTQ files and final BAM files was checked with FastQC (v0.11.5) [58] together with multiQC (1.11) [59]. To further visually inspect the quality of the data, BAM alignment files were converted to bigWig files (initially without normalizing) and browsed with Integrative Genomics Viewer (IGV_2.7.2) [60]. To generate bigWig files, BAM files were first converted to BED files with coverage information using bedtools bamtobed, these files were then converted to COV files using bedtools genomecov, and finally bigWig files were generated from the COV files with UCSC bedGraphToBigWig (v4) [61]. Chromosome sizes for bedtools genomecov and bedGraphToBigWig were obtained with refgenie (0.9.3) [54].
Regions of enrichment, i.e., peaks, were generated from BED coverage files using SICER (v1.1) [62] against input (separate input files for children and mothers) with redundancy threshold = 1, window size = 1000, fragment size = 425, effective genome fraction = 0.74, gap size = 5000, and FDR = 0.01.
Count tables with the number of mapped reads for each identified region were created for children and mothers separately. Identified peaks were first merged (mothers and children separately) with bedtools merge and quantification was done with bedtools multicov using BAM alignment files and merged regions.
Code is available from: https://github.com/AubleLab/H3K9me3_stunting.
Normalization
Drosophila spike-in reads were used for normalization to account for possible global changes in H3K9me3 levels (Fig, S3a). To retrieve the number of spike-in reads within each sample, raw FASTQ files were mapped to the Drosophila dm6 reference genome (obtained with refgenie) using bowtie2 with default parameters. SAMtools (0.1.19-44428cd) [55] were used to convert alignment reads to BAM format (function: view -S -b), sort (function: sort), and index (function: index) BAM files, ENCODE blacklisted sites [56] were removed with bedtools (v2.26.0) [57] intersect function with argument -v, and finally SAMtools were used to report number of mapped reads (function: view -c -F 4). The final normalization factors were then determined by dividing the number of reads that mapped to dm6 by an arbitrary constant (500,000). These normalization factors were then used in differential peak analysis with DESeq2 (1.30.1) [63]. Inverted values of the normalization factors (1/NF) were used to generate normalized bigWig files, specifically bedtools genomecov function with argument -scale set to the inverted normalization factor was used to create normalized bedGraph files from BAM alignment files, which were further converted to normalized bigWig files with UCSC wigToBigWig tool (v4) [61].
Correlations between the percentage of reads mapped to the dm6 reference genome (indicative of globally misregulated H3K9me3 levels, see Additional file 1: Fig. S3d) with all available biomarkers/measurements were calculated using R cor function with method = “spearman”.
The relationship between the percentage of reads mapped to dm6 and a child’s ΔHAZ (1 year) score was determined by fitting a linear model with the R lm function, where the p-value < 0.05 was considered significant.
Differential analysis
DESeq2 (1.30.1) [63] was used to identify differentially regulated H3K9me3 regions in both child and maternal datasets. Raw count tables along with drosophila normalization factors were passed to DESeq2 with a formula set to child’s sex + child’s ΔHAZ (1 year) score. Regions with FDR corrected p-values (padj) less than 0.05 were identified as significantly different. Other combinations of formulas combining a child’s sex with other available biomarkers were tested (data not shown), however, no significant regions were identified.
PCA
PCA was performed using prcomp R function applied to spike-in- and regularized logarithm- (DESeq2 (1.30.1) [63] rlog function) normalized count tables. Counts from all available regions were used for PCA. Results were plotted using ggplot2 (3.3.6) [64] and points colored based on child’s ΔHAZ (1 year) score with a palette from viridis package (0.6.2) [65]. PCA performed on both children’s and maternal data together was applied on spike-in-normalized counts over merged (bedtools merge function) children’s and maternal H3K9me3 regions. Correlations between PC1 with all available biomarkers/measurements were calculated using R cor function with method = “spearman”.
Average profiles
Average spike-in-normalized H3K9me3 profiles were obtained by first using the deepTools (3.3.1) [66] computeMatrix function to quantify spike-in-normalized bigWig files over H3K9me3 regions identified in children’s datasets. The computeMatrix parameters were set to scale-regions mode with -a 2000, and -b 2000 parameters to expand average profiles by additional ± 2 kb surrounding the H3K9me3 regions. From computeMatrix output, the values for average profiles were obtained with the deepTools plotProfile function with the outFileNameData parameter included, to get the actual average profile values in addition to the figure output. The resulting average profile values were then plotted using ggplot2 (3.3.6) [64], where smoothing of the profiles was done with geom_smooth function with parameters set to method = "loess", span = 0.2 and colored based on child’s ΔHAZ (1 year) score using viridis package (0.6.2) [65].
Average profiles using read count per million mapped reads normalization were obtained by mapping BAM alignment files onto the H3K9me3 regions with ngs.plot (2.61) [67] with default settings, which automatically returns average profiles normalized to read count per million mapped reads. These profiles were then replotted with ggplot2 to include coloring based on a child’s ΔHAZ (1 year) score with viridis package.
Correlations between the centers of average profiles with all available biomarkers/measurements were calculated using the R cor function with method = “spearman”.
Functional annotation of children’s H3K9me3 regions
The top 10% of H3K9me3 regions (based on FDR-corrected p-value, followed by p-value) that were identified in children’s datasets as differentially misregulated versus child’s ΔHAZ (1 year) score were compared with all of the H3K9me3 regions identified in the children’s datasets using GenomicDistributions (1.5.5) [68]. CalcFeatureDistRefTSS function with default parameters was used to compare the distances of the regions to the nearest TSSs (transcription start sites), and the distances were plotted with plotFeatureDist function with parameters featureName="TSS", size = 1000000, infBins = T, nbins = 300. CalcPartitions function was used to identify overlaps with genomic classes. The genomic classes used as an input to the function were the default classes provided by the GenomicDistributions package with PBMC enhancer regions added from EnhancerAtlas 2 [69]. To create a list of PBMC enhancer regions, enhancer regions from relevant cell types were downloaded from EnhancerAtlas 2, namely CD4+, CD8+, CD14+, CD19+, CD20+, GM10847, GM12878, GM12891, GM12892, GM18505, GM18526, GM18951, GM19099, GM19193, GM19238, GM19239, GM19240, and PBMC cells. Finally, the regions were merged using the bedtools (v2.26.0) [57] merge function. The parameters to the calcPartitions function were set to bpProportion = T, and the overlaps were plotted with plotPartitions function with default parameters. The cell-type specificity of the regions was determined by using calcSummarySignal function with provided H3K9me3 cell-type specific signal matrix (see “Normalized cell-type specific H3K9me3 signal matrix” section of Methods) and plotted with plotSummarySignal function with parameters plotType = "violinPlot", metadata = cellTypeMetadata, colorColumn = "tissueOfOrigin". A modified version of calcCumulativePartitions function was used to calculate cumulative distribution of the overlaps with genomic classes. The original function sorts regions based on their size; the modified function accepts presorted regions. The genomic classes used were identical with the ones used in calcPartitions function. The log2(fold-change) of cumulative distribution values of intergenic / intronic regions were calculated and smoothed for plotting using R loess function with parameter span = 0.15.
Histone mark and DBF (DNA-binding factor) enrichment of the top 10% significantly misregulated regions in children was calculated using LOLA (1.20.0) [70] against the CISTROME [71, 72] database of histone marks and transcription factors (here referred to as DBFs) filtered for blood cell types. The userUniverse parameter in LOLA was set to all defined children’s H3K9me3 regions, otherwise all default parameters were used. Significant enrichments were defined as those with q-values < 0.05. The most significant cell-type/antibody combinations were plotted in form of a dot plot, where dot sizes and dot colors reflect the significance of the enrichment for a given cell-type/antibody combination. Note that CISTROME provides only regions for the hg38 genome assembly, therefore all H3K9me3 regions used here were first converted to hg38 using UCSC liftOver tool [61].
Gene set enrichment of the top 10% significantly misregulated regions in children was done with GREAT [73]. Results were then reported as bar plots. Due to a larger number of identified GO:BP (Gene Ontology: Biological Process) terms, those were not reported in the bar plot and instead were first processed with REVIGO [74] with default parameters. The resulting network was further reorganized in Cytoscape (3.8.0) [75] and terms were manually classified into groups.
Normalized cell-type specific H3K9me3 signal matrix
A matrix containing normalized H3K9me3 values across different cell types was used to infer cell-type specificity of regions of interest with GenomicDistributions (1.5.5) [68] calcSummarySignal function. To create the H3K9me3 cell-specific matrix, ENCODE [76] data was used. First, BAM alignment files from all available H3K9me3 ChIP-seq experiments on primary cells were used (ENCODE filters: Assay title – Histone ChIP-seq / Target of assay – H3K9me3 / Organism – Homo sapiens / Biosample classification – primary cell / Genome assembly – hg19 / Available file types – BAM). The downloaded BAM files were first sorted and indexed with SAMtools (0.1.19-44428cd) [55] sort and index functions, respectively. To define H3K9me3 regions across genome, BED files were also obtained from ENCODE (ENCODE filters: Assay title – Histone ChIP-seq / Target of assay – H3K9me3 / Organism – Homo sapiens / Biosample classification – primary cell / Genome assembly – hg19 / Available file types – BED narrowPeak). BED files were sorted (sort -k1,1 -k2,2n) and merged with bedtools (v2.26.0) [57] merge function. Raw signal values were quantified by using bedtools multicov function on H3K9me3 processed BAM files with merged H3K9me3 BED file. The resulting raw count tables were further processed in R. First, samples from the same cell-type origin were merged by summing the cell-type replicate signal values, followed by normalization of the signal values for individual cell types. The normalization was performed in three steps: i) signal values above the 99th percentile for a given cell type were set to 1; ii) signal values bellow the 99th percentile for a given cell type were normalized to range between 0 and 1 using the following formula: \({n}_{i}=\frac{{x}_{i}-min\left(x\right)}{max\left(x\right)-min\left(x\right)}\), where ni is the normalized signal value for region i, xi is the raw signal value for a region i, and x is a vector of raw signal values for a given cell type; and iii) signal values ranging between 0 and 1 within a given cell type were further quantile-normalized with normalize.quantiles function from preprocessCore package (1.52.1) [77].
Gene coverage heatmap
Heatmaps with spike-in-normalized coverage over genes were generated using deepTools (3.3.1) [66] computeMatrix function with scale-regions mode, and parameters --sortRegions descend, -b 2000, -a 2000. Mapped were spike-in-normalized bigWig files onto a BED file containing hg19 coordinates of protein-coding genes along with information about strandness. The gene coordinates were obtained with EnsDb.Hsapiens.v75 (2.99.0) [78], ensembldb (2.14.1) [79], and AnnotationFilter (1.14.0) [80] packages, where ensembldb genes function was applied to EnsDb.Hsapiens.v75 object with AnnotationFilter set up to ~ gene_biotype == "protein_coding". Only standard chromosomes were kept using GenomeInfoDb (1.26.7) [81] keepStandardChromosomes function. The output from the computeMatrix function was then plotted as a heatmap with deepTools plotHeatmap function with parameters -- kmeans 2 --whatToShow 'heatmap and colorbar'.
Genes reported in cluster 1 were then passed to g:Profiler [82] for gene set enrichment analysis. Only KEGG, Reactome, WikiPathways and GO:BP terms were searched, as suggested by Reimand et al [83]. Significant terms (padj < 0.05) were plotted as bar plots showing -log10(padj) values, except for GO:BP terms (due to a larger number of identified GO:BP terms), which were first processed by REVIGO [74] with default parameters, resulting network was further reorganized in Cytoscape (3.8.0) [75] and terms were manually classified into groups.
Analysis of transposable elements
A library of TEs (transposable elements) was obtained from RepeatMasker [84] library 20140131 for the hg19 genome assembly. Genomic coordinates for individual TE classes were extracted into BED files using R. Unambiguous classes such as RNA, DNA, or classes with a question mark in them were removed from further analysis. Overlaps with H3K9me3 regions were calculated using GenomicDistributions (1.5.5) [68] calcExpectedPartitions function with arguments bpProportion = T, and genomeSize set to sum of hg19 chromosome sizes, as instructed in the GenomicDistributions manual. Overlaps were then plotted with the plotExpectedPartitions function with added ggplot2 layer to show color intensity based on the overall size of overlap in number of base pairs.
To functionally annotate interesting TEs, intersections of all ERVs, LINE L1, SVA retrotransposons, and telomeric satellite DNAs with the top 10% significant, as well as all defined H3K9me3 regions, were obtained with the bedtools (v2.26.0) [57] intersect function. Intersections with the top 10% H3K9me3 regions were passed to GREAT [73], where intersections with all H3K9me3 regions were used as background. Maximum top 20 GO:BP terms, as displayed by GREAT, were then plotted with ggplot2 (3.3.6) [64] as bar plots showing negative log10 of FDR-corrected p-values.
Gene targets with significantly increased H3K27ac in 18-week-old children were obtained from Kupkova et al, 2018 [36] and compared to the gene targets of all ERVs within the top 10% significant H3K9me3 regions annotated by GREAT. Results were plotted as an upset plot using the upset function from the UpSetR package (1.4.0) [85]. Gene targets shared between ERVs and H3K27ac regions were visualized and functionally annotated with STRING [86].
Relationships between H3K9me3 in mothers and children
PCA applied to both sets of data was performed as described in the PCA section of Methods.
Results from differential analysis (negative log2(fold-change) per unit change of child’s ΔHAZ score for both children and mothers) were plotted as box plots using ggplot2 (3.3.6) [64] and two-sided one-sample t-tests were used to determine any global unidirectional changes (H0: µlog2FC = 0, H1: µlog2FC ≠ 0, α = 0.05). To show direct comparisons of log2(fold-changes) between children and mothers, H3K9me3 regions from mothers were overlapped with H3K9me3 regions from children with the bedtools (v2.26.0) [57] intersect function with parameter -wao, which returns the original coordinates for overlapping regions. Log2(fold-changes) from the overlapping regions were then plotted as a scatter plot with ggplot2 and the Pearson’s correlation coefficient was calculated with the R cor function.
Correlation analysis between maternal and children’s samples was performed by first merging all maternal and child regions with bedtools merge followed by quantification of all maternal and child samples with bedtools multicov. A correlation matrix was then created by passing the count table to the R cor function with method = “pearson” and plotted as a heatmap using the Heatmap function from ComplexHeatmap (2.6.2) [87], where both rows and columns were clustered using average linkage.
To identify regions specific to mothers of stunted children / mothers of healthy children / stunted children / healthy children, all regions from all child and maternal datasets were first merged with the bedtools merge function to create a list of “master regions”. With the same function, regions of individuals belonging to a given category were merged, creating “category regions”. To compare the regions shared among or specific to categories, overlaps were found between “master regions” and individual “category regions” using bedtools intersect with the -wao parameter, which reports the original entries rather than the intersections. “Master regions” that were identified as overlapping with a given category were then passed to the upset function from UpSetR package (1.4.0) [85] to reveal the number of regions shared between a given combination of categories. Regions “shared in stunting” were defined as those “master regions” that had overlaps found with regions from mothers of stunted children and with regions from stunted children, but no other category, and analogically were defined regions “shared in health”. Summarization of the regions shared in health and shared in control individuals was performed using GenomicDistributions (1.5.5) [68], specifically, the calcFeatureDistRefTSS function was used to compare the distances to the nearest TSS. The log10-transformed distances were then plotted with ggplot2 as density plots, as these provided an easier visual comparison for a smaller number of regions compared to the default histograms offered by GenomicDistributions. Cell-type specificity inference and distribution across genomic classes was calculated as described in the “Functional annotation of children’s H3K9me3 regions” section of Methods.
Gene set enrichment analysis of regions shared in stunting and regions shared in health was done by first assigning the regions to genes with GREAT [73] followed by passing the genes to g:Profiler [82] for identification of enriched biological terms (sources of biological terms were filtered to GO:BP, GO:MF (Gene Ontology: Molecular Function), KEGG, WikiPathways, Reactome). Significant terms were those with adjusted p-value < 0.05. Results with < 10 significantly enriched terms were plotted as bar plots showing -log10(padj) values. Larger sets of terms were passed as GEM files produced by g:Profiler to the EnrichmentMap app [88] in Cytoscape (3.8.0) [75], where terms were organized into networks followed by manual organization of related terms into clusters.
Integration with H3K4me3 and H3K27ac
Differential analysis results for H3K4me3 and H3K27ac data from 18-week-old and one-year-old children were obtained from Uchiyama et al [35] and Kupkova et al [36] respectively. These results show association of the H3K4me3 and H3K27ac peaks with a child’s ΔHAZ (18wk) score for 18-week-old-children and ΔHAZ (1 year) score for one-year-old children. Significantly affected regions were considered those with padj < 0.05. Regions were extracted from the result tables and converted to BED files. Overlaps with H3K9me3 regions from this study were found with the bedtools (v2.26.0) [57] intersect function with parameter -wao (to report original regions instead of just intersections). Proportions of overlaps were calculated as the ratio between an overlap and the size of a given region.
Gene set enrichment analysis of overlaps was done as described in the “Relationships between H3K9me3 in mothers and children” section above and histone mark / DBF enrichment analysis was performed as described in the “Functional annotation of children’s H3K9me3 regions” section above. The userUniverse in LOLA was set to overlaps between H3K9me3 and all upregulated H3K4me3 regions in one-year-old stunted children for H3K4me3 analysis, and to overlaps between H3K9me3 and all downregulated H3K27ac regions in one-year-old stunted children for H3K27ac analysis. For H3K27ac, only highly significant enrichments (padj < 0.001) were reported due to the large number of enriched histone marks and DBFs.
Additional tools used
Tidyverse package (1.3.1) [89] was used to process data in R. Figures were assembled with Inkscape (1.0.2) [90], Illustrations were created with BioRender [91], and icons were obtained from the Noun Project [92].