Integrated Transcriptome-Wide Proling And Protein Structure Analysis of Pathogenic Genes In Venous Thromboembolism

Background: Genetic factors are considered to determine the balance of the coagulation and anticoagulation processes, yet the genetic variants related to venous thromboembolism (VTE) remain unclear. This study aimed to investigate the potential molecular mechanisms and pathogenic mutations associated with VTE by determining VTE-related differentially expressed genes (DEGs) by transcriptome-wide proling and assaying protein structure in VTE. Methods: Two gene expression datasets, GSE48000 and GSE19151, were accessed from the Gene Expression Omnibus (GEO) database to obtain gene expression data associated with VTE. We identied the DEGs between VTE patients and healthy people using R and performed functional enrichment analysis, including Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. Then, whole-exome sequencing (WES) was performed for 25 VTE patients and 17 normal cases, and the structural locations of pathogenic missense mutations were identied using pyMOL. Finally, DGIdb database was used to select candidate drugs for the treatment of VTE. Results: A total of 232 DEGs were identied from the GEO database. The signicant function of these DEGs was mostly involved in RNA catabolic process and ribosome pathway. Notably, the results of WES for DEGs and protein structure analysis showed that Histamine N-Methyltransferase (HNMT) (chr2: 138759649 C>T, rs11558538) may be a main predisposing factor for VTE. In addition, Amodiaquine, Harmaline, Aspirin, Metoprine, Dabigatran, and Diphenhydramine were screened for VTE therapy. Conclusion: The results showed that HNMT (chr2: 138759649 C>T, rs11558538) may be potential target for the diagnosis and treatment of VTE.


Introduction
Venous thromboembolism (VTE) is the third most common cardiovascular disease worldwide, which manifests as deep-vein thrombosis, pulmonary embolism, or both [1,2]. Various epidemiological studies have demonstrated that the incidence of VTE is characterized by a remarkable number of genetic and environmental factors. In early epidemiological studies, the highest incidence of VTE was in Africa, followed by Caucasians and was lower for Asian [3]. With increased awareness of the diagnosis and management of VTE, the incidence in Asian has increased in recent years [4].
While VTE is classi ed as a complex, multifactorial and polygenic disease, common mechanisms driven VTA have been con rmed, such as gene-gene and gene-environment interactions [5]. Stasis, vessel damage, and a hypercoagulable state are three widely accepted mechanisms related to the occurrence of VTE [6]. Genetic epidemiological studies have revealed that genetic conditions are signi cant risk factors for VTE, accounting for up to 50% of all VTE patients including anticoagulant protein de ciency trapping (de ciency of protein C, protein S, and antithrombin), factor V Leiden mutation (FVL) (c.1601G > A, p.R534Q), prothrombin G20210A mutation (FII G20210A), hyperhomocysteinemia, elevated coagulation factors VIII, IX, X, histidine rich glycoprotein, and ABO blood group [7,8]. However, only a few genetic factors have been considered and the distribution of FVL and FII G20210A mutation depends on ethnic group, race, or geographical region, suggesting that there is still an urgent need to identify VTE pathogenic genetic factors [9].
The original raw expression data at the probe level was downloaded as CEL les and pre-processed and normalized with RMA using the 'affy' package in R version 4.0.2, followed by converting the data into corresponding gene expression data based on the different platform speci cations [15]. The Pearson's correlation coe cient was determined to validate the intra-group data repeatability and heatmap generation was visualized based on the 'heatmap' R package [16]. Principal component analysis (PCA) was conducted to view the clustering trends according to sample-to-sample distances using the 'ggord' package in R [17].

Identi cation of DEGs
The 'limma' package in R program was applied to screen DEGs between VTE samples and normal samples [18]. A two-tailed t-test was performed to examine DEGs by log2 (Fold Change) > 1 or < − 1 and adjusted P value < 0.05. Genes satisfying these conditions were grouped separately as DEGs by volcano plot in R [19].

Gene ontology and KEGG pathway enrichment analysis
GO is used to describe the biological processes (BPs), molecular functions (MFs), and cellular components (CCs) of gene products in a hierarchical ontology [20]. Signaling pathway analysis was conducted to map DEGS to the Kyoto Encyclopedia of Gene and Genomes (KEGG), which is a pathway-related database for systematic and comprehensive analysis of gene functions [21]. GO and KEGG pathway enrichment analyses were performed using the 'cluster Pro ler' package in R version 4.0.2 and P values less than 0.05 were considered statistically signi cant [22]. A GO network was visualized using the Metascape database (http://metascape.org) to validate our results [23].
2.6 Whole-exome sequencing for DEGs DNA was extracted from each patient using a DNA extraction kit (Qiagen, Hilden, Germany) from whole blood. Library construction, WES, and data analysis were conducted by iGeneTech in Shanghai. Then, 200 ng of genomic DNA from each individual was sheared by Biorupter (Diagenode, Belgium) to acquire 150-200 bp fragments. The ends of DNA fragment were repaired and Illumina adapters were added (Fast Library Prep Kit, iGeneTech, Beijing, China). After sequencing libraries were constructed, whole exomes were captured using the AIExome Enrichment Kit V1 (iGeneTech, Beijing, China) and libraries sequenced on an Illumina NovaSeq 6000 (Illumina, San Diego, CA) Next-Generation sequencing platform in the 150 bp PE mode. Bioinformatics analysis was performed to analyze nonsynonymous mutations including single-nucleotide polymorphisms (SNPs) and insertion-deletions (INDELs) using GATK (Genome Analysis Toolkit). All allele frequency data for DEG mutations were compared with the 1000 Genomes Project (http://www.1000genomes.org) and Exome Aggregation Consortium ExAC (http://exac.broadinstitute.org).

Molecular Docking
Ligand docking of HNMT and drugs was performed with default parameters using AutoDock molecular docking software (version 4.2) and the coordinates and box size were nalized according to ligand location [25].

Statistical Analysis
DEGs were selected based on t test. The whole genome GO categories and pathogenic mutations for these DEGs were identi ed using Fisher's exact test. The signi cance level for statistical tests was set at less than 0.05 (P < 0.05).

Pearson's Correlation Testing and PCA
Pearson's correlation test showed strong correlations between VTE and control samples in the GSE48000 dataset (Fig. 1A). The PCA pro le for the GSE48000 data revealed that the distances between samples were small in the VTE groups and control groups, respectively (Fig. 1B). Pearson's correlation analysis also indicated strong correlations for the GSE19151 data among the samples in the VTE group and control group, respectively (Fig. 1C). The close distance in the dimension of PCA illustrated the acceptable data repeatability between samples in the VTE group and control group for the GSE19151 dataset (Fig. 1D).

Identi cation of DEGs in VTE
As shown in Fig. 2, a total of 232 genes were designated as DEGs in the VTE group when compared with the control group. The volcano plots in this gure present the DEGs with a cutoff criteria of having an adjusted P-value < 0.05 and |log 2 fold change|>1 in the GSE48000 and GSE19151 datasets ( Fig. 2A and   2B). As examples of these differences, the top 10 differentially expressed genes are reported in Table 1.

Enrichment of DEGs by GO and KEGG Analysis
Gene functional enrichment analysis was performed to analyze the biological connections of these DEGs. Results of Gene Ontology (GO) enrichment analysis revealed that RNA catabolic process, viral gene expression, SRP-dependent cotranslational protein targeting to membrane, and viral transcription were the main biological processes (BPs) (Fig. 3A), and structural constituent of ribosome, cytochrome-c oxidase activity, and heme-copper terminal oxidase activity were the most enriched categories of molecular functions for these DEGs (Fig. 3B). The variations in cell component (CC) of DEGs were enriched largely in ribosome and hemoglobin complex (Fig. 3C). KEGG pathway analysis indicated that these DEGs were mainly involved in particular pathways, such as the ribosome, Huntington disease, and oxidative phosphorylation (Fig. 3D). Metascape was used to visualize these gene enrichment analyses to verify our results from R (Fig. 3E). We found that these DEGs were enriched in amino acid de ciency, ribosomal complex, oxidative phosphorylation, rRNA transcript, and blood coagulation.

Identi cation of Probable Disease-causing DEGs by WES
WES revealed 48 mutations of DEGs in the VTE group. The mutation types and the log 2 fold change are shown in Table 2. Because nonsynonymous mutations are most likely to affect protein function, we focused on the four SNP variants corresponding with four DEGs in the VTE group. These were HNMT (ch2: 138759649 C > T, rs11558538, adjusted P-value = 1.2 × 10 − 9 ), POLL (chr10: 103340056 G > A, rs3730477, adjusted P-value = 5.12 × 10 − 4 ), ZNF292 (chr6: 87925827 A > G, rs9362415, adjusted P-value = 2.95 × 10 − 8 ), and DPCD (chr10: 103361088 C > T, rs7874, adjusted P-value = 4.36 × 10 − 5 ). The adjusted Pvalue of HNMT was the lowest in this study. Functional analysis showed that most disease-causing DEGs were involved in anemia, sickle cell, pulmonary thromboembolisms, heparin-induced thrombocytopenia, thrombophilia, and so on, as shown by Metascape functional analysis (Fig. 4). As summarized in Table 3, we found that HNMT was expressed in heparin-induced thrombocytopenia, dermatitis, and atopic cases, conditions that may have strong impacts on VTE.

Protein Structure and Characterization of Missense HNMT Mutations
The Thr105Ile (rs11558538) polymorphism in the HNMT gene (ch2: 138759649 C > T, rs11558538, adjusted P-value = 1.2 × 10 − 9 ) was the biggest difference identi ed in a gene, and should result in nonsense-mediated decay and loss function of this protein. The 3D location is shown in Fig. 5A and Fig. 5B. The variant was positioned in the α-helix, where its side chain hydroxyl formed two hydrogen bonds with a backbone oxygen after mutation, causing a marked decrease in the levels of both HNMT enzymatic activity and immunoreactive protein [26,27]. HNMT is an enzyme that has been implicated in neurotransmission by inactivating histamine in the central nervous system [28]. However, histamine increases vascular permeability through the histamine H1 receptor to activate nerve endings, relaxing vascular smooth muscle [29].

Molecular Docking
The drug-target interactions for HNMT were predicted using DGIdb, and the results are presented in Table 4, providing a theoretical therapeutic mechanism for VTE prevention. Six drugs targeting HNMT have been predicted for VTE, including Amodiaquine, Harmaline, Aspirin, Metoprine, Dabigatran, and Diphenhydramine.
Molecular docking analysis was attempted to assess the potential noncovalent binding of HNMT with these small molecules drugs. In general, a lower binding energy indicated a stronger binding between HNMT and a compound. Table 4 shows the six drugs that best interfaced with HNMT. To visualize these docking results, the 3D interaction diagrams of HNMT and their corresponding best-matched drugs were drawn, as shown in Fig. 6. These drugs, such as Aspirin and Dabigatran, have been utilized to recanalize vessels and prevent thrombi growth clinically in VTE patients [30][31][32]. The 3D interaction diagram of Aspirin at the active site of HNMT revealed that this interaction was stable through forming hydrogen bonds with the key residues Lys55 and Lys135 (Fig. 6B). Aspirin is commonly administered to inhibit platelet aggregation and prevent thrombus formation [33]. Additionally, three hydrogen bonds formed with residues Phe9, Tyr15, and Ser91 contributed to stabilizing the interaction between Dabigatran and HNMT. Dabigatran has been approved for use in orthopedic surgery, venous thromboprophylaxis, acute VTE treatment, and extended prevention of recurrent VTE [34]. Our data had shown that HNMT can potentially become a new target for VTE treatment.

Discussion
In the present study, transcriptomics and proteomics technology were used to explore the potential pathways and pathogenic mutations of VTE occurrence.
High-throughput pharmacology and molecular docking may allow for the investigation of novel biomarkers for detecting this complex diseases [35].
We rst studied VTE by downloading transcriptome-wide expression data from the GEO database and a total of 232 DEGs were identi ed. Results of GO analysis of the gene enrichment in these datasets showed that the VTE-associated DEGs were signi cantly enriched in RNA catabolic process. Wang, HX found previously that RPL9, RPL35, and RPS7 were hub genes in the PPI network of GSE13985, which was used to identify potential markers of atherosclerosis development [36]. Interestingly, a study from Mi, YH reported that the major risk factors for atherothrombotic disease were also signi cantly associated with VTE, which contributes to the explanation of why atherosclerosis is an independent risk factor for VTE [37]. KEGG pathway analysis revealed that the DEGs related to VTE were mainly enriched in the ribosome pathway. Recent evidence has suggested that the ribosome affects the translation of platelets, platelet aggregation, and resultant thrombus formation [38,39]. It may be reasonable for us to then hypothesize that ribosomal proteins might have crucial functions in VTE development. However, there is no direct evidence that RNA catabolic processes and the ribosome pathway are directly involved in VTE.
To verify the above results, whole-exome sequencing (WES) was performed to detect the pathogenic mutations of DEGs. POLL encodes the novel DNA polymerase lambda and the mutation of POLL (rs3730477) encoded R438W Pol λ leading to genomic instability and mutagenesis in cells [40]. DPCD (rs7874) is named from an uncharacterized genomic region surrounding POLL. DPCD is a novel gene in primary ciliary dyskinesia and severe cases can induce pulmonary embolism [41]. ZNF292 (rs9362415) acts as a transcription factor and plays an important role in DNA recognition and apoptosis regulation [42]. However, little is known about the role of DNA related functions in VTE. Notably, we discovered that HNMT (rs11558538) polymorphism was the greatest differentially expressed factor in this study. As is well-known, HNMT is implicated in neurotransmission by inactivating histamine, and histamine has been argued to relax vascular smooth muscle [29]. From a protein structure analysis, we found that the Thr105Ile mutation results in hydrogen bonds in the structure of HNMT being disrupted, resulting in loss-of-function mutations [43]. The 3D structure diagram of HNMT showed quite different in protein conformations between the Thr105 and Ile105 variants. Furthermore, a list of drugs targeting HNMT with potential therapeutic e cacy against VTE were selected, most notably Aspirin and Dabigatran. As a consequence, we inferred that the high expression of mutated HNMT acted on vascular smooth muscle and may further promote vasoconstriction and thrombosis through RNA catabolic process and the ribosome. However, the mechanisms of these genes and drugs in VTE are still unclear. In future work, we hope to verify our conclusions experimentally to elucidate the effects of Thr105Ile (rs11558538) in HNMT for VTE.

Conclusions
The current study was designed to investigate potential DEGs and genetic variant of DEGs in VTE (Fig. 7). RNA catabolic process and ribosomes pathway identi ed through integrated bioinformatic analysis of GSE48000 and GSE19151 datasets may play crucial roles in the development of VTE. Additionally, Thr105Ile (rs11558538) polymorphism in HNMT was identi ed as a risk factor for VTE in the mechanism of damage and dysfunctional to the vascular endothelial cell and vascular smooth muscle. In the future, more in-depth investigation about the mechanism of these candidate genes is warranted for VTE.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.