Characterization of the semen methylome
Our study aims to explore the relationship between DNA methylation of semen and factors that drive changes in methylation including semen cell type composition and chronological age. The study workflow is illustrated in Fig. 1A. We collected semen and buccal swab samples from 83 male patients. The seminal fluid was characterized based on sperm concentration, motility, and morphology. A majority of samples came from men between ages 30 and 45 years old (mean 38.8); individuals with abnormal semen parameters were evenly distributed across the sample age range (Fig. 1B). The body mass index and semen parameters for the study cohort are provided in Fig. 1C and Table 1.
Table 1
Age, body mass index, and semen analysis parameters for the study cohort.
| Mean ± S.D. | Range |
Age (y) | 38.7 ± 6.2 | 27.0–61.5 |
Body mass index (BMI) | 26.6 ± 6.0 | 18.7–61.6 |
Semen analysis |
Semen volume (mL) | 2.5 ± 1.3 | 0.1–7.5 |
Sperm concentration (M/mL) | 63.9 ± 57.0 | 0–216.1 |
Total sperm motility (%) | 46.3 ± 21.3 | 0–81 |
Strict morphology (%) | 20.1 ± 16.7 | 0–80.8 |
Semen pH | 8.2 ± 0.24 | 8.0–8.8 |
We measured the DNA methylation of each sample using targeted bisulfite sequencing. This approach has certain advantages over the widely used DNA methylation arrays, Reduced Representation Bisulfite Sequencing (RRBS) or Whole Genome Bisulfite Sequencing (WGBS), as it allows for the selection of regions of interest in the genome and the generation of high coverage data to obtain accurate estimates of DNA methylation at those sites. By contrast, arrays have fixed probes, and RRBS of WGBS tend to generate lower coverage datasets. The limitation of our approach is that the number of targets we select is limited compared to other methods, but our selection criteria allows us to enrich biologically interesting sites based on the mining of prior datasets. The probes we used in our targeted bisulfite sequencing assay were chosen from different sources: some were selected from the EWAS Atlas [33] and represent regions with significant EWAS hits across multiple studies; others were collected from epigenetic clocks, such as the Horvath [15], Hannum [16], GrimAge [34] and PhenoAge [35] clocks; and another set contained sites with cell type specific DNA methylation regions (see Supplementary File 1 with list of probes). Our targeted panel leads to the generation of a methylation matrix of 72000 CpG sites across 83 individuals, with a minimum coverage of 40 and average coverage of 74.
Cell type analysis of DNA methylation data
It is widely recognized that cell type heterogeneity can have a significant impact on DNA methylation levels. To account for cell type heterogeneity in seminal fluid we considered four major cell types, including sperm cells, prostate epithelial cells, lymphoid cells and an unknown fraction of cell types that we suspect to have characteristics of myeloid cells. For the buccal swab specimens, we included both epithelial and immune cells. We obtained the whole-genome bisulfite profiles of these cell types from [28] and [29]. We conducted cell type deconvolution for the semen samples using a non-negative regression approach (Fig. 2). Not surprisingly, the deconvolution results suggest an inverse relationship between sperm cell abundance and the other cell types. Additionally, we identified two azoospermia samples that had no sperm cell content but displayed a significant proportion of an unknown cell type with myeloid characteristics. These individuals have an impairment of sperm production with unidentified etiology.
Identification of factors that influence DNA methylation in semen
To jointly characterize multiple factors that influence DNA methylation in seminal fluid we used multivariate multiple regression. Seminal fluid DNA methylation was used to train a multifactor model incorporating a constant term, a batch effect term, age, sperm cell composition, and prostate epithelial cell composition. The lymphoid cell composition and other cell type composition were highly correlated among themselves and with the sperm cell composition, hence these were removed to include only the two aforementioned cell types. We solve the model using the pseudoinverse method. Specifically, we first estimated the coefficient matrix as \(\beta = {{X}_{}}^{†}{\cdot M}_{}^{obs}\), where \({{X}_{}}^{†}\) is the Moore-Penrose pseudoinverse of the multifactor phenotype matrix \({X}_{ }.\) Once the coefficient matrix was obtained, we computed the pseudoinverse again to predict the multifactor phenotype matrix as \({X}_{}^{pred}{={{M}_{}^{obs}\cdot \beta }_{}}^{†}\), and predict the methylation matrix as \({M}_{}^{pred}=X\cdot {\beta }_{}\). We correlated the predicted values of factors with their actual values and found that all three factors included in the model have statistically significant correlations between their predicted and actual values (p < 0.05 and R ≥ 0.4) (Fig. 3).
Analysis of factor associated methylation sites
In order to identify the DNA methylation sites that are most associated with the factors in our model, three filters were defined. First, we filtered for sites whose methylation level was highly correlated with the predicted methylation level \({M}_{}^{pred}\) generated from the multifactor model. Next, we selected methylation sites from the multivariate multiple linear regression model that had statistically significant coefficients. Third, we retained only methylation sites that had the highest absolute correlation coefficient among all factors from the coefficient matrix \(\beta\). Finally, if there were more than 200 sites that met these criteria we chose the top 200 sites with the lowest adjusted p-values. To further analyze the factor-specific CpG sites, we divided the sites into positive and negative groups based on the sign of the coefficient in the multifactor model. Further details of the site selection procedure are described in the Methods section.
When we analyzed the selected sites associated with age in the seminal multifactor model, we found 83 methylation sites that had negative coefficients, which indicates the loss of methylation at these sites during aging. For sperm cell composition, we identified 48 sites with positive coefficients and the remaining 152 sites with negative coefficients. We analyzed the two groups of sites separately. For simplicity in naming, we will refer to the sites with negative coefficients for sperm cell composition as "negative sperm sites," and vice versa for the positive sites.
We used Cistrome to identify transcription factors associated with the significant sites. Cistrome identifies enriched transcription factor (TF) binding sites within a set of input regions, allowing identification of potential regulators of methylation sites. The genomic coordinates of the factor-associated sites were input as peak sets to the Cistrome Data Browser (Cistrome DB). ChIP seq datasets from Cistrome DB were overlapped with the peak sets we defined, and transcription factors with a significant binding overlap were returned. Functional enrichment analysis was also performed using Cistrome-GO, and the significant Gene Ontology (GO) terms from cellular components, molecular functions, and biological processes were identified.
For the positive sperm sites, the top three transcription factors were NELFA, BRD4, and TAL1 (Fig. 4A). The positive coefficient suggests that the genes associated with these sites are repressed with increasing sperm percentage. NELFA is part of the NELF protein complex and negatively regulates RNA polymerase II transcription elongation [36]. BRD4 is a kinase that phosphorylates RNA polymerase II and regulates transcription [37], and TAL1 is involved in multiple cellular processes including myeloid cell differentiation and positive regulation of cellular component organization [38]. The significant GO terms from molecular functions include cytokine activity, cytokine receptor binding, and chemokine activity. The significant GO terms from biological processes include immune response, defense response, and cytokine-mediated signaling pathway (Supplementary Table 1–2).
Among the negative sperm sites, the top three transcription factors were LRWD1, POLR3D, and PR (Fig. 4A). We expect that the expression of the genes bound by these factors increases with increasing sperm percentage in the seminal fluid. LRWD1 plays a major role in organization of heterochromatin structure in the somatic cells and is widely observed in human testis [39], which is also shown in Fig. 4B. POLR3D is involved in RNA polymerase III transcription processes, and is associated with cell growth and proliferation [40]. The progesterone receptor (PR) gene is reported to cause male infertility [41]. The significant GO terms from cellular components include chylomicron and very-low-density lipoprotein particles. The significant GO terms from molecular functions include adenosine deaminase activity, receptor serine/threonine kinase binding, and deaminase activity (Supplementary Table 3–5).
For the negative age sites, the top three transcription factors are associated with POLR2A, ZNF768, and NR3C1 (Fig. 4C). POLR2A is another RNA polymerase that is responsible for the transcription of a large fraction of protein-coding genes [42]. ZNF768 is a transcription factor [43] and the NR3C1 gene encodes glucocorticoid receptors and is found to regulate testicular functions [44]. The significant GO terms from cellular components include inflammasome complex, chylomicron, and cytoplasmic region. The significant GO terms from molecular functions include tau protein binding. The significant GO terms from biological processes include negative regulation of viral entry into host cell, regulation of viral entry into host cell, and cytokine-mediated signaling pathway (Supplementary Table 6–8).
We also examined the tissue level expression of gene proximal to the significant methylation sites using GTEx, a public database that allows us to query the gene expression level across different human tissues. We chose the top fifty genes associated with negative age sites, positive sperm sites, and negative sperm sites. The genes are ranked by adjusted RP score calculated on Cistrome-GO, which measures the proximity to promoters. We then identified their gene expression level across different tissues. Among the genes associated with negative age sites, we observed expression across a broad range of human tissues. In particular, the genes that encode for interferon-induced transmembrane protein 1 (IFITM1), IFITM2, and IFITM3 were found to be specifically expressed in whole blood and EBV transformed lymphocytes (Fig. 5A). In terms of the negative sperm sites, we noticed the high expression level of genes including testis specific serine kinase 6 (TSSK6) and adenosine deaminase domain containing 1 (ADAD1), which were most highly expressed in human testis (Fig. 5B).
Identification of factors that influence DNA methylation in buccal swabs
DNA methylation in human buccal swabs has been shown to be a strong predictor of chronological age [45]. In this study, we sought to compare age associated changes in buccal swabs and semen. We identified potential overlaps and differences between the significantly age associated methylation sites in buccal swabs and semen to characterize common and divergent mechanisms of aging. To accomplish this goal, we performed similar analysis for the buccal DNA methylation as for the semen methylation, which were collected from the same set of individuals. We developed a multifactor model for the buccal methylation sites. The method for constructing the buccal multifactor model was similar to that used for the semen multifactor model. The buccal multifactor model incorporates a constant term, a batch effect term, and two phenotypic factors: age and epithelial cell composition. Both phenotypic factors exhibited statistical significance (p < 0.05) and strong correlation coefficients (R ≥ 0.6) when regressing the predicted values against the actual values (Fig. 6).
We applied the same filtering procedure used in the semen multifactor model to select significant buccal methylation sites (see Method). We specifically analyzed the selected sites associated with age in the buccal multifactor model. We found 190 methylation sites associated with age with positive coefficients and 10 sites with negative coefficients. We analyzed the two groups of sites separately. For simplicity in naming, we will refer to the sites in the negative group associated with age as "negative age sites," and so on.
We used Cistrome to conduct the analysis of the significant sites. For the negative age sites, the top three transcription factors were MED1, SPI1, and EZH2 (Fig. 7A). Among other functions the mediator complex subunit 1 (MED1) is also involved in promoting oral mucosal wound healing and acts as a master regulator of epithelial cell fate [46, 47]. The SPI1 gene encodes transcription factors that activate gene expression during myeloid and B-lymphoid cell development [48], and EZH2 is crucial for the maintenance of epithelial cell barrier integrity and an active participant that shapes the aging epigenome [49, 50]. As for positive age sites, the top three transcription factors are associated with RNF2, JARID2, and REST (Fig. 7B). The RNF2 and JARID2 genes are involved in transcriptional repression of genes involved in development and cell proliferation [51, 52], and REST is a gene silencing transcription factor that is widely expressed during embryogenesis, and represses non neuronal lineage genes in tissues outside of the brain [53].
Histone modification analysis
We also examined histone modification patterns of negative age sites from seminal fluid and buccal swabs using Cistrome DB. H3K27me3 is ranked as the top histone modification for both seminal fluid and buccal swabs. Previous studies have shown that H3K27me3 is involved in silencing gene expression during embryonic stem cell differentiation [54, 55]. We observe a distinct pattern of modification between the seminal fluids and buccal swabs specifically on H3K36me3 (Fig. 8A; Fig. 8B). H3K36me3 is associated with regions that are transcribed, such as the gene bodies [56, 57]. By contrast, the buccal swab sites contain H3K4me3, suggesting they are promoter sites. Since they contain both H3K27me3 and H3K4me3, they are likely enriched for bivalent promoters that repress genes in stem cells that are activated in a tissue specific manner.