A compilation of CHD-associated single nucleotide polymorphisms and their annotations
To identify CHD-associated SNPs, we retrieved data from single nucleotide polymorphism (SNP) resources based on Genome-wide Association Studies (GWAS), including NHGRI-EBI GWAS-catalog (19) and ClinVar (20, 21) databases. This systematic approach resulted in a cumulative set of 15,876 CHD-SNPs (computational pipeline illustrated in Fig. 1A). As of August, 2023, the GWAS-Catalog contains a total of 5,39,949 SNPs while ClinVar database consists of a total of 4,609,367 variations including SNPs and other types of genomic variations. In GWAS-catalog, out of 5,39,949 SNPs, 1,884 unique statistically significant (p-value < 10e− 6) SNPs were found to be associated with CHD-related traits (4, 22–25) (CHD-SNPs; Table S1 and S2). Consequently, ClinVar consisted of 4,609,367 variations, out of which we retrieved 15,248 unique CHD-specific variations which consisted of 13,992 CHD-SNPs and other types of variations, including 629 deletions, 287 duplications, 107 indels, 43 insertions and 190 microsatellites (Table S3). Further, ANNOVAR (26) annotation of CHD-related variants revealed that, from GWAS-catalog, 33.55% and 43.63% lies within the intergenic and intronic regions of the genome while only 4.46% of the variants were from the exonic part of the genome (Fig. 1B). On contrary, ClinVar database consisted of a large percentage (83.54%) of CHD-SNPs within the exonic part of the genome while only 0.01% and 7.03% of the CHD-SNPs were present on intergenic and intronic regions, respectively (Fig. 1C). These differences exist mainly due to the fact that more than 80% of ClinVar associations result from whole exome sequencing studies (WES), thus predominantly capturing the protein-coding part of the genome. Additionally, ClinVar primarily deposits fully curated and clinically tested genetic variations.
Based on their location, CHD-SNPs were classified into: (i) coding, indicative of sequences that transcribe into proteins, and (ii) noncoding, denoting sequences void of protein-coding potential. Noncoding regions specifically encompass distal intergenic, intronic, and promoter regions, the latter of which is demarcated by the stretch of region from 1kb upstream to 1kb downstream of the transcription start site (TSS). In total, out of 1,884 CHD-SNPs extracted from the GWAS-catalog, 163 and 1,721 were located in coding and noncoding regions of the genome, respectively. The latter consisted of 632 in distal intergenic, 822 in intronic, 216 in ncRNA-intronic and 51 in promoter regions. While in the ClinVar database, out of 13,392 CHD-SNPs, we obtained 12,171 within coding and 1,221 within noncoding genomic regions, with the latter consisting of 2 in distal intergenic, 984 in intronic, 214 in ncRNA-intronic and 21 in promoter regions.
Identification of a potential set of human cardiac enhancers
We then sought to ascertain the identity of the noncoding CHD-SNP containing regions as enhancers. Enhancer regions are enriched by specific histone modification marks which could also indicate their activity (27–30). Among these histone modification marks, H3K4me1 is associated with the presence of enhancers, while the presence of H3K27ac is a signature of active enhancers, differentiating them from their poised counterparts (53). Additionally, H3K4me3 predominantly marks promoter sequences (31, 32). We hypothesize that noncoding CHD-SNPs might disrupt enhancer regions and consequently, TF binding. To test this, we first sought to assess the region carrying the CHD-SNP as functional enhancers. For this purpose, each noncoding CHD-SNP was expanded by 75 bp on both flanks, yielding a 150 bp genomic segment with the SNP centrally located. We then probed these segments for the presence of enhancer-specific histone modification marks. Utilizing epigenomic datasets (33) from human heart organogenesis across nine Carnegie stages (CS13, CS14, CS16-21 and CS23), we sought regions marked by the histone modification marks H3K4me1, H3K27ac, and H3K4me3 (pvalueSignal ≥ 9.0). Such elements exhibiting significant enrichment for these marks were designated as putative cardiac enhancers. Together, we obtained a set of 2,056 CHD-SNP-containing putative cardiac enhancers (CHD-enhancers) (Table S4), which consisted of 1,139 from GWAS-catalog and 917 from the ClinVar database. The analysis of CHD-enhancers distribution across Carnegie stages, illustrated in Fig. 2A, unveils the dynamic landscape of CHD-enhancers shaping the embryonic development of the heart. Notably, the highest count of 1,138 enhancers were identified in the later developmental stage (CS23) while in the initial stage (CS13), 821 enhancers were predicted. The lowest number of enhancers, 97, were found in CS17 stage. Moreover, the co-occurrence of the predicted enhancers was also observed, as shown in the intersection set in Fig. 2A, where for instance, out of the 821 identified enhancers in CS13, 530 were also found to be expressed in other stages, while 291 were specific to CS13. Furthermore, the occurrence of each predicted CHD-enhancer in a specific developmental stage, its presence or absence in other stages could potentially reveal intricate patterns of regulatory activity throughout embryonic heart development (Fig. 2B).
Subsequently, to corroborate our findings, we cross-referenced CHD-enhancers with human enhancer disease database (HEDD) (41). This intersection resulted in the identification of 801 enhancers that overlapped with CHD-enhancers (Table S5). Further, we categorized CHD-enhancers based on their macs2pval signal across all Carnegie stages. This resulted in 55 early, 85 intermediate, 205 late, 2 early-late and 17 always-active putative cardiac enhancers (Fig. 2C; Table S6).
Enrichment of transcription factor binding motifs in putative cardiac enhancer regions
Enhancers regulate gene activity by binding to tissue-specific TFs at specific sites known as transcription factor binding sites or motifs (TFBSs) (54, 55). Disease-associated noncoding variants within enhancer regions often perturb the TFBSs, resulting in disrupted transcriptional output of their target genes (56). To prioritize CHD-enhancers based on their potential regulatory function and study CHD-specific regulatory networks, CHD-enhancer regions were analyzed for the presence of TFBSs. From the motif enrichment analysis, we identified 163 statistically significant TFBSs (threshold on q-value < 0.05) (Table S7). Notably, many of these identified TFs play specific roles in early cardiac development, valvulogenesis, and cardiac maturation processes (Table 1). Among the TFBSs identified was that of Wilms' tumor-1 (Wt1) (q-value = 0.00023), known for its expression in cardiac endothelial cells during both normal heart development and post-infarction periods (57). Prior studies have elucidated its vital role in heart vessel formation (57). Another TF identified in this analysis was GATA Binding Protein 4 (GATA4) which is known to play a pivotal role in cardiac progenitor cells specification and cardiac septation (58). Additionally, our analysis revealed a significant over-representation of TFBSs from the Kruppel-like factor (KLF) family, recognized for their specialized roles in heart development (59). These included KLF2 (60, 61), KLF4 (62, 63), and KLF15 (64), the latter shown to be overexpressed during episodes of heart pressure overload, typically seen in hypertrophic cardiomyopathies. It also regulates GATA4 expression to stabilize cardiomyocyte size (64). Our analysis also identified the TFBS for Spleen focus forming virus proviral integration oncogene (SPI1), known for its heightened expression post-myocardial infarction (65). From the TEAD TF family, TEAD1 was enriched, documented for its expression in adult mammalian hearts and association with normal heart contractility where its mutations can lead to dilated cardiomyopathy (66).
Other significant TFBSs enriched among the CHD-enhancers include Estrogen Related Receptor Alpha (ESRRA), known for its pivotal role in postnatal cardiac maturation through mitochondria-focused biogenesis in cardiomyocytes (67); PLAG1 Like Zinc Finger 1 (PLAGL1), important for sarcomere development and contractility in cardiomyocytes (68, 69); and TGFB induced factor homeobox 1 (TGIF1), with an established role in a cardiac neural crest pathway crucial for the proper septation of the outflow tract (70). Together, the presence of these motifs and their linked TFs in our potential CHD-enhancers highlights their possible role in early heart development and maturation.
Table 1
A list of significantly enriched Transcription Factor Binding Sites (TFBSs) identified within CHD-enhancers through motif enrichment analysis. The "Motif" column represents the DNA sequence of each binding site; "TF" specifies the transcription factor, and "q-value" indicates the statistical significance level of the enriched motif.
Motif
|
Sequence
|
q-value
|
Logo
|
Wt1
|
chr12:2493497–2493647
|
0.000236
|
|
KLF4
|
chr12:124016103–124016253
|
0.00254
|
|
KLF15
|
chr12:2550685–2550835
|
0.00277
|
|
TEAD1
|
chr10:103539779–103539929
|
0.00397
|
|
ESR2
|
chr7:44542312–44542462
|
0.00529
|
|
Plagl1
|
chr8:124847533–124847683
|
0.0125
|
|
NKX2-2
|
chr11:2570545–2570695
|
0.0288
|
|
GATA4
|
chr7:22728430–22728580
|
0.0405
|
|
Putative cardiac enhancers and CHD-SNPs showed high conservancy across 99 vertebrate species
Although enhancers are known to be resistant to a certain degree of sequence mutations (71), it is noteworthy to highlight that enhancers with pivotal roles in developmental processes exhibit high conservation across vertebrates (38, 72). We therefore sought to employ evolutionary conservation analysis of CHD-enhancers and their CHD-SNPs across 99 distantly related vertebrate species (39). We identified 131 conserved enhancers (131/2056, 6.37%) out of total 2,056 putative enhancers, with significantly high mean \(phastCons score\ge 0.6\)(38). Among these, 33 (33/2056, 1.60%) revealed ultra-high sequence conservation \(phastCons score\ge 0.8\) (40) (Table S8). To validate the significance of the evolutionary conservation observed in putative cardiac enhancers, we compared their conservation scores with randomly generated genomic sequences. This comparison allowed us to determine whether the observed conservation in our enhancers is indeed enhancer specific or merely reflective of broader genomic patterns. From this comparison, we obtained a mean conservation score of \(0.2143\) (\(SD= 0.2243\)) within the identified set of putative cardiac enhancers. In contrast, conservation analysis of random sequences revealed low mean conservation value \(0.08264\) (\(SD= 0.1501\)). Statistically significant differences were obtained between the conservation scores of compared groups with \(p value<{2.2e}^{-16}\)from wilcoxon rank sum test. The higher conservation scores for putative cardiac enhancers in comparison to random genomic elements suggest their vital evolutionary function (Fig. 2D) and highlight their functional relevance in cardiac development. Furthermore, determining per-nucleotide conservation of CHD-SNP locations revealed 48 conserved variants with \(phyloP score\ge 6.0\) contained across 99 vertebrate species (Table S9).
To further determine the potential functional impact of CHD-SNPs on their target gene expression, we investigated the presence of regulatory SNPs (rSNPs) with known expression quantitative trait loci (eQTLs obtained from GTEx database (34)) in CHD-enhancers. We found 63 CHD-SNPs with known eQTLs which were distributed across various cardiac tissues, including 16 in the aorta, 9 in the coronary artery, 23 in the heart atrial appendage, and 15 in the left ventricle. These variants were designated as regulatory CHD-SNPs (rCHD-SNPs; Table S10). Notably, among rCHD-SNPs, an intronic variant rs12724121 (chr1:236688982–236688983) was situated within a conserved CHD-enhancer element (chr1:236688907–236689057). This region aligns with DNaseI open chromatin regions from fetal heart tissue and contained multiple TFBSs, including those specific to cardiac function (Fig. 3). The presence of rs12724121 in this evolutionarily conserved context suggests its potential regulatory role in cardiac-related processes. Moreover, the convergence of conservation, enhancer-specific histone modification marks, and TFBS density highlights the intricate regulatory landscape surrounding this particular CHD-SNP.
CHD-SNPs within the coding regions can disrupt the protein-protein binding efficiency
SNPs within protein-coding genes can potentially disrupt protein-protein interactions, leading to abnormal cellular signaling and developmental defects that may contribute to CHD pathogenesis (17, 48, 49). To examine this, a protein-protein interaction network (PPIN) was constructed from the set of genes encompassing 11,770 coding CHD-SNPs. PPIN (Figure S1) consists of 306 proteins (286 genes from our query and 20 additional genes resulting from GeneMania-Cytoscape module output (42, 43)) with 675 edges (connecting physically interacting proteins). The clustering analysis of the PPIN identified 16 functional modules. These clusters ranged in size, with the largest cluster, denoted as cluster 1 (Figure S2) encompassing 72 proteins and 304 edges. The smallest cluster consisted of only 3 nodes and 2 edges. Next, to study the impact of SNPs on the binding affinity of the physically interacting proteins, we selected two cardiac sarcomeric proteins from cluster-1 namely, cardiac myosin binding protein-C (MYBPC3) and cardiac α-actin (ACTC1) (16) (Fig. 4A, B). Prior studies revealed a substantial number of individuals with hypertrophic cardiomyopathy (HCM) and dilated cardiomyopathy exhibit alterations in sarcomeric proteins (17, 18). Therefore, to analyze the effect of CHD-SNPs on sarcomeric proteins, we obtained an experimentally resolved protein complex for MYBPC3-ACTC1 (18) from protein data bank (PDB). Subsequently, the structural analysis of this protein complex (using PDBePISA (73)) revealed 15 interface residues and the remaining 331 residues were located on its surface. Changes in binding energy, a measure to assess protein complex stability (\(\varDelta \varDelta G\)) was computed for 15 distinct mutations (exonic CHD-SNPs) and assessed their impact on protein binding stability using MutaBind2 (52) (Table 2). From this analysis, we found a specific CHD-SNP (rs770030288) which is located in C2 domain of MYBPC3 protein with arginine to histidine substitution at 419 position has deleterious effect, which may impair its interaction with ACTC1 protein. Notably, the evolutionary analysis of MYBPC3 protein sequence using ConSurf (47) also detects high conservancy of arginine, highlighting the functional importance of this residue in preserving the conformational stability of the protein (Fig. 4C). Moreover, upon inducing the R419H mutation, histidine (mutated residue) interacts with only 70 other atoms in vicinity (Fig. 4D (b, d)) while the non-mutated form arginine interacts with 152 atoms, suggesting structural instability upon mutation (Fig. 4D (a, c)).
Table 2
Mutabind2 analysis of CHD-SNPs impact on Protein-Protein binding affinity. The columns “Protein 1 chain” and “Protein 2 chain” represents the protein 1 and protein 2 chains in a complex, “Mutated Chain” column represents the mutated protein chain, “Mutation” specifies the mutation, “DDG” quantifies the variation in free energy upon mutation and its impact on protein complex stability. A mutation is classified as deleterious if DDG + 1.5 or − 1.5 kcal mol− 1. The column “Interface” indicates if the mutation occurs at the protein-protein interface or not.
Protein 1 chain
|
Protein 2 chain
|
Mutated Chain
|
CHD-SNP
|
DDG
|
Interface
|
Deleterious
|
A
|
G
|
G
|
R419H
|
1.64
|
no
|
yes
|
A
|
G
|
G
|
K380R
|
0.44
|
yes
|
no
|
A
|
G
|
G
|
E451Q
|
0.03
|
no
|
no
|
A
|
G
|
G
|
G416S
|
1.35
|
no
|
no
|
A
|
G
|
G
|
E441K
|
1.23
|
no
|
no
|
A
|
G
|
G
|
F448S
|
0.29
|
no
|
no
|
A
|
G
|
G
|
A364T
|
0.41
|
no
|
no
|
A
|
G
|
G
|
A429V
|
-0.02
|
no
|
no
|
A
|
G
|
G
|
K380N
|
1.08
|
yes
|
no
|
A
|
G
|
G
|
K380R
|
0.44
|
yes
|
no
|
A
|
G
|
A
|
N92S
|
0.3
|
no
|
no
|
A
|
G
|
A
|
G245D
|
1.27
|
no
|
no
|
A
|
G
|
A
|
I267T
|
0.6
|
no
|
no
|
A
|
G
|
A
|
F21L
|
0.44
|
yes
|
no
|
A
|
G
|
A
|
L8M
|
0.19
|
no
|
no
|