2.1. Study participants and study design
Tehran Cardiometabolic Genetic Study (TCGS) is a family-based genetic database according to a larger strudy, Tehran Lipid and Glucose Study (TLGS)(11–13). TLGS is a population-based cohort started in 1999, and study subjects have been followed up through six phases (every three years) on non-communicable disorders’ (NCDs) and related risk factors (14). The current study extracted 4,214 unrelated participants aged 18 or above, based on the well-designed protocol in previous GWAS in TCGS (9,10). HTN incident cases and a random sample of healthy individuals with two or more follow-up records were included in the study using a nested case-control sampling design.
2.2. Ethics and consent
This research was approved by the Medical Ethics Committee of Shahid Beheshti University of Medical Sciences and the National Committee for Ethics in Biomedical Research in May 2021 with code IR.SBMU.ENDOCRINE.REC.1400.024. All participants gave written informed consent to participate. The research has been performed in accordance with the Declaration of Helsinki.
2.3. Measurements
Principles for clinical investigations could be found in the original TCGS and TLGS papers (19,22). Having SBP ≥140 mmHg or DBP ≥90 mmHg or taking antihypertensive treatment was hypertension definition. Risk factors of high blood pressure traits (SBP, DBP, and HTN) were chosen based on previous GWAS literature(6,9,10,15). The following predictors were included in the model as covariates: sex, age, and body mass index (BMI). BMI was measured as the ratio of body weight to squared height (kg/m2). Using the Amelia package in R, height and weight were imputed multiple times using the Expectation-Maximization method with Bootstrapping (EMB) approach, for which the missing rate of covariate’s values ranged from 0.2–7.7% over five phases of TLGS. Traits and their risk factors were tested for their distribution among males and females, using the T-test and Chi-square test.
2.4. Genotyping and quality control
Blood samples of TCGS participants were genotyped using humanOmniExpress-24-v1 bead chips on 652919 single nucleotide polymorphisms with an average mean distance of 4 kilobases for each individual at deCODE genetic company (16). Genomic quality control (QC) was implemented in comprehensive steps(10,17).
2.5. Testing the integrated effect of gene
Sequence Kernel Association Test (SKAT) was used to test the overall effect of AGT, ZBED9, and TNXB. Optimal SNP-set kernel-based association test (SKAT-O) is used to test the association between a set of variants in a region in the genome, either rare or common or both cumulative effects, and a trait (binary or quantitative) by a linear or logistic regression framework(18). SKAT-O obtains an aggregated score of individual variants and efficiently obtains a corresponding SNP-set P-value. Moreover, the covariates are adjusted in SKAT-O. In the case of the linear kernel, test statistic can be expressed as a weighted sum of the single-marker score test statistic. SKAT-O is usually considered as a variance component test that assumes the genotyping effects are random. P-value is provided for each set of variants of ZBED9, AGT and TNXB, separately. All analyses are done using the SKAT R package, version 2.0.1 (19). Readers can find the workflow of gene-gene interactions between AGT, ZBED9, and TNXB involved in high blood pressure traits (SBP, DBP, and hypertension) in Supplementary Figure. 1
2.6. Making variant collection
After specifying genes, 65 variants, including three previously reported tag SNPs (ZBED9: rs450630, TNXB : rs2021783, AGT: rs2493134), were extracted in the TCGS database. Next, variants in the selected regions that were in low to moderate LD (r2≤0.4) with the tag SNPs were extracted(N=13). The reliefF algorithm, as a machine learning approach for classification and feature selection, was applied to extract the most significant variants to reduce the computational expenses (20). The final set consisted of 11 SNPs, including four variants located in TNXB and AGT and three in ZBED9. Supplementary Table. 1 showes the information of genetic variants in ZBED9, AGT, and TNXB and their LDs. Figure 1 depicts the three genes. As illustrated in the figure, the ZBED9 gene contains two polymorphisms in exons 3 and 4, and one polymorphism in intron region 2. One polymorphism is located in exon 21 and the other three are located in the introns of the TNXB gene. Additionally, the AGT gene contains a variant in exon 2 and others in the intronic regions, and the interaction of these polymorphisms results in a significant association with hypertension.
2.7. Single Locus Analysis
The generalized Linear Model (GLM) was employed in four different genetic models to determine the optimal genetic model for every single variant and extract the effects of causal variants. These models included additive, dominant, recessive, and co-dominant models. To investigate the effect size of rare variants (MAF < 0.05), Firth’s Bias-reduced logistic regression was used by the R logistf package (21). Coding labels for the additive model were defined based on the minor allele count, as BB=0, AB=1, and AA=2. Dominant, recessive and co-dominant models were defined as AA + AB versus BB, AA versus AB+BB and AB versus AA+BB genotypes, respectively.
2.8. Epistasis and Local Interactions
Model-based Multifactor Dimensionality reduction (MB-MDR) has been proposed by Calle et al. as a semiparametric machine learning dimension-reduction algorithm. MB-MDR incorporates three main steps, (i) constructing two-way contingency tables for the combination of SNPs (p-locus model) and creating a new categorical variable X. (ii) computing the wald statistic for the testing association of each X variable and the trait. (iii) conducting a permutation test to prevent inflating alpha error rate(Calle et al. 2008). In the first step, two-way contingency tables are constructed for each SNP combination (p-locus model). Each cell in the contingency table is considered as a “Low Risk”, “High-Risk”, or “No-Effect” category, based on the student’s t-test statistic, to test if the mean difference of a cell with the rest of the cells is either significant or non-significant. If the difference between cell means is significant at P-value=0.10, the cell will be labelled as either “Low Risk” or “High-Risk”, based on the t statistic’s sign; otherwise, it will be considered as “No-Effect”. If the sign of the t statistic is positive, then the corresponding cell will be labelled as “High-Risk”; otherwise, it is “Low Risk”. Thus, a new “categorical variable X= {“No effect”, “Low effect”, “High effect”}” will be created in the first step of Mb-MDR. The Wald test statistic is conducted between the categorical variable X and the trait in the second step. We considered the maximum of “wald-low effect” and “wald-high effect”, which are statistics for the mean differences of the “Low effect” and “High effect” category with two other categories, respectively. To avoid increasing type I error, a permutation test and bootstrapping as cross-validation tests are being assessed for evaluating the significance of the maximum of “wald-low effect” and “wald-high effect” for each p-locus model in the third step (Calle et al. 2008).
The current study employed this method to detect antagonistic and synergistic interactions within (local) and between loci of ZBED9 with AGT and TNXB. All interaction effects were calculated by the “mbmdr” package in R version 4.0.3 software, and P-values were adjusted using the Bonferroni correction method. Multiple logistic regression was employed to capture the corresponding significant genotypes for each interaction model, both synergistic and antagonistic interactions. To obtain an entropy-based interaction graph and visualize pairwise interactions between variants, mdr 3.0.2 software was used. To do so, independent effects of variants and their pairwise interaction effects were assessed and visualized using IGENT, entropy-based gene-gene interaction method for genome-wide interaction analysis. Fully description of the IGENT is described elsewhere (23).
2.9 Investigation of the missense variations affect the proteins
We applied the bioinformatics analysis to investigate the effects of the missense variation on the structure of the proteins coded by our interesting genes. The ZBED9 is an intrinsically disordered protein (IDP) [1, 2]. In the free and native conditions, IDPs have no single well-defined tertiary structure. Therefore, the bioinformatics tools were applied to identify disordered regions and disordered binding regions of this protein. IUPred2A [3, 4] is a valid and user-friendly webserver to Predict our desired information. The sequence of ZBED9 was obtained from NCBI with the protein Accession Q6R2W3.1. The 6i3f PDB file [5] of the RCSB database was used as the 3D structure of the AGT protein. The visualization of PDB files has been done by Chimera [6].