An integrated genome-wide model for hookah constituents-gene-pathway-disease relationships

Hookah smoking is becoming increasingly popular worldwide and recent epidemiological studies indicate that it is signicantly associated with respiratory diseases, and could contain constituents more harmful than cigarette smoking. However, the gene expression alterations and mechanisms of disease development induced by the hookah constituents are still unclear. Here, we are the rst to propose a new experimental design and use RNA sequencing to prole bronchial epithelial cells from different hookah smoking exposure conditions (GSE162386). For our hookah-smoking dataset, we developed an integrated genome-wide model, hierarchical-systems biology model (HiSBiM), to systematically investigate the effects of smoking hookah and to compare the biological pathways with public cigarette smoking datasets. Our results show that the infectious disease-related pathways were enriched in hookah smoking than in cigarette smoking, which could be due to polluted hookah devices or pathogen contaminated water lters. Furthermore, we investigate the subsystem and pathway differences in hookah smoking between charcoal and electronic heating ways. We observed that cardiovascular and atherosclerosis-related pathways were activated after exposure to charcoal combustion, which may relate to combustion components carbon monoxide and polyaromatic hydrocarbons. Overall, our results show that hookah smoking could involve a higher risk in several diseases than cigarette smoking and link the relations between hookah constituents to diseases. We believe that our HiSBiM is a useful integrated method for providing multiple-level valued insights for genome-wide analysis on various omics data sets.


Introduction
Hookah smoking is becoming increasingly popular worldwide. Recent epidemiological studies have indicated that hookah smoking is signi cantly associated with cancer, cardiovascular and lung diseases, and may contain more harmful constituents than cigarette smoking [1][2][3][4][5] . Besides, some reports have indicated that toxic substances of hookah smoking are generated from charcoal emissions by burning charcoal together with tobacco, which yielded about 15 folds carbon monoxide (CO) and polyaromatic hydrocarbons (PAHs) and 3 folds nicotine compared to cigarette smoking [6][7][8] . However, traditional epidemiology or biology alone are not su cient to elucidate the underlying mechanisms of hookah smoking and its constituents. Large omics data has led to potential use of systems biology approaches for a better understanding of biological mechanisms of hookah smoking. Currently, the omics-based insights exploring the relationships between hookah smoking, hookah constituents and cigarette smoking are still lacking.
To address these issues, we enhanced the omics data analysis strategy, Hierarchical-Systems Biology Model (HiSBiM) 9 , to explore the relationships between cigarette, hookah and hookah constituentsgene/protein-pathway-disease. In this study, the bronchial epithelial cell line (BEAS-2B) was used to expose ve different hookah conditions in 2 and 24 hours, such as charcoal with no shisha, charcoal with non-tobacco (herbal) shisha, and charcoal with tobacco shisha, and all of the samples were performed NGS to study the effects and mechanisms of hookah smoking and its constituents (e.g., CO, PAH). The HiSBiM showed that hookah smoking mainly induced cancer, bacterial or viral infections, and cardiovascular disease-related subsystems, as well as signi cantly changed cell cycle-related genes (e.g., CCND1) in viral infection pathway. We also observed that CO and PAH constituents in hookah smoking are closely related to cardiovascular diseases, such as atherosclerosis. In summary, we believe that our HiSBiM model was useful in genome-wide data analysis of hookah and cigarette smoking for elaborating disease mechanisms.

Materials And Methods
The overview of our method for investigating hookah smoking and its constituents are described in Fig. 1, the key steps summarized in Fig. 1A. To explore the effect of hookah smoking and its constituents, we designed cell-based experiments for hookah exposure in two heating ways and tobacco or herbal shisha, and all of the 30 samples were performed RNA sequencing to generate gene expression pro les and have been deposited to Gene Expression Omnibus 10 (GEO accession GSE162386) (Fig. 1B). Cigarette smoking datasets were collected from public omics databases (Fig. 1C). Then, we identi ed DEGs from our hookah-smoking gene expression pro les and public cigarette smoking datasets, and evaluated pathway enrichment based on the hypergeometric distribution (Figs. 1D and 1E). Next, the p-values were transformed into z-scores to quantify KEGG pathways (Fig. 1F). Finally, the meta-z-scores of subsystemand system-levels were calculated from pathway-level z-scores to quantify subsystems and systems (Fig. 1G).

Hookah exposure experimental design
To study the harmful effects of hookah smoking on cells, we designed cell-based experiments for hookah smoking exposure with various constituents and treatment times ( Fig. 1B; Table 1). The bronchial epithelial cells (BEAS-2B) were cultured and exposed with and without tobacco in two heating ways of hookah smoking, including internal control (IC), charcoal with no shisha (C), charcoal with non-tobacco (herbal) shisha (CNT), charcoal with tobacco shisha (CT), and electronic heater with tobacco shisha (ET) for 30 min while on a rocker. The exposure concentration was measured as the highest that caused no toxicity at 24 hours, as assayed by LDH release. For each exposure condition, three wells were treated with Trizol at 2 and 24 hours after exposure. RNA extraction was performed by the RNeasy Plus Mini Kit, Qiagen, according to the manufacturer's instructions. Table 1 No. of samples of our in-house datasets.

Groups
After exposure 2hr After exposure 24hr Control (IC) 3 3 Charcoal (C) 3 3 Charcoal + non-tobacco (CNT) 3 3 Charcoal + tobacco (CT) 3 3 Electronic + tobacco (ET) 3 3 Next generation sequencing All of the 30 samples were performed RNA-seq to analyze whole-genome transcriptome using Illumina HiSeq 2500 platform with 50 bp single-end reads (Illumina, USA). For each sample, the reads were aligned with the human reference genome (GRCh38) using the HISAT2 package 11   Hierarchical-systems biology model For omics data analysis, we rst identi ed the differentially expressed genes (DEGs) in each compared group with |log2FC| > 0.58 and p-value < 0.05 by using the limma package 15 (see Supplementary Table S1 online). Based on our identi ed DEGs, the hypergeometric distribution was used for pathway enrichment analysis and pathways with p-value < 0.05 were considered signi cant 16,17 (Fig. 1D). Here, the p-value is calculated as (1): where N is the number of genes recorded in KEGG, M is the number of DEGs, n is the number of genes in the speci c KEGG pathway, and i is the number of genes that are DEGs in the speci c KEGG pathway. The p-value of each pathway was transformed into the z-score using the standard normal distribution (Fig. 2). We then computed the subsystem-and system-level meta-z-scores for 45 KEGG subsystems and 6 systems (Figs. 1E and 1F), such as metabolism, genetic information processing, and human diseases, based on our previous study 9 . The meta z-score is de ned as (2): where z i is the z-score of the pathway i, and n is the total number of pathways in a subsystem or system.
The meta-z-score re ects the signi cance (enrichment) of a subsystem or system for a speci c condition or disorder, such as hookah smoking.

HiSBiM for analyzing the mechanisms of hookah smoking
We used HiSBiM to observe and compare omics of hookah and cigarette smoking via the hierarchical manner. There are two groups of hookah smoking, including 2 hours (H1) and 24 hours (H2) after exposure, and three cigarette smoking datasets (Cig1, 2, and 3). We observed that the two system-level displayed the differences in Human Diseases (mean meta-z-scores 12.4 and 5.3, respectively) and Environmental Information Processing (mean meta-z-scores 6.6 and 2.5) between hookah smoking and cigarette smoking (Fig. 3A).
In subsystem-level of Human Diseases, the top 5 subsystems of hookah compared to cigarette smoking were mainly related to Cancers: Speci c types (mean meta-z-scores 9.6 and 2.5), Infectious diseases: Viral (mean meta-z-scores 6.0 and 1.4), and Cardiovascular diseases (mean meta-z-scores 6.22 and 3.16) (Fig. 3B). This is in agreement with hookah smoking being indicated as a risk factor for cancers, respiratory infections and cardiovascular system 3 . Also, higher frequency of hookah smoking is associated with the risk increasing of esophageal cancer, and an average of six folds higher risk linked to lung cancer 1,2,4,5 . Interestingly, we observed that the pathway with the major difference of hookah versus cigarette smoking was Human papillomavirus infection (mean z-scores 3.13 and 0.24) belonging to subsystem of Infectious diseases: Viral (Fig. 3C). These phenomena might be caused by polluted hookah devices or microbe contaminated water lters. The similar observation of hookah and papillomavirus infection relationship had been reported by some epidemiological studies 18,19 , but still lack of omicsbased analysis for exploring the mechanisms. Among the DEGs obtained here, we observed that CCND1 (log2FC: 1.21 and 1.00), CCND3 (log2FC: 0.76 and 0.59), and LAMB3 (log2FC: 0.64 and 0.47) were highly expressed after hookah exposure for 2 and 24 hours, respectively (Fig. 3D). The genes CCND1 and CCND3 are cyclin D genes, coding for cell-cycle regulatory proteins involved in the control of the G1/S phase transition 20 . LAMB3 encodes the beta 3 subunit of laminin, reported to mediate the attachment, migration and organization of cells into tissues during embryonic development by interacting with other extracellular matrix components 21 . Also, in hookah exposed case of 24 hours, PIK3R1 was up-regulated (log2FC: 0.74) which encodes the PI3K regulatory subunit p85α involving in proliferation, apoptosis, and metabolism functions 22 . For linking the associations of genes-pathway-disease of hookah smoking, we mapped hookah smoking DEGs to Human papillomavirus infection pathway. The results showed that hookah smoking may aggravate pathogen infection by increasing cell cycle and proliferation (Fig. 3D).

Hookah constituents-gene-pathway-disease relationships
For investigating the relationship between hookah smoking, its constituents, and cigarette smoking, we clustered eight groups, including hookah, CO and PAH (CP1 and CP2), nicotine (N1 and N2), and cigarette (Cig1, Cig2 and Cig3), based on their corresponding subsystem-level meta-z-scores (Fig. 4A). The results showed that hookah and cigarette groups were clearly distinguished into two clusters, which represent a signi cant difference in the mechanisms of hookah and cigarettes smoking. The hookah, CO and PAH, and nicotine groups were further divided into two clusters: 1) Hookah and CO and PAH grouped in the same cluster; 2) Nicotine groups are an independent cluster.
By comparing the top 10 subsystems, we found that CO and PAH (CP) dataset shared 7 subsystems with hookah smoking dataset (Jaccard index: 0.54) (Fig. 4B). This means that hookah constituents CO and PAH may have a high correlation with the effect mechanisms of hookah smoking use. The CO and PAH mainly effected certain speci c subsystems, including Cardiovascular diseases (meta-z-score: 2.03), Endocrine system (meta-z-score: 1.96), and Amino acid metabolism (meta-z-score: 0.98). Furthermore, we found that Fluid shear stress and atherosclerosis with the highest z-score (12.17) in pathway-level of Cardiovascular diseases (Fig. 4C). We then mapped the DEGs of CO and PAH dataset into KEGG pathway and observed that hookah constituents could link to atherogenesis (Fig. 4D). The associations between PAH and atherogenesis also have been reported where the PAHs-exposed mice showed increased macrophage content in atherosclerotic plaques 23 . Here in this pathway, we nd that Transcription factor AP-1 (JUN) was activated (log2FC: 1.61) after 2 hours exposure to hookah smoking (i.e., CO and PAH) triggering promoter down-stream gene expression (e.g., chemokines). In long term effect of hookah smoking exposure, CCL2 was up-regulated (log2FC: 0.65) which induced pro-atherogenesis, in ammatory response and adhesion of leukocytes to the endothelium.
Currently, there is still a lack of systems biology approaches to underlying the relationship between hookah smoking, its constituents, and disease. In this current research study, we are the rst to propose a new experimental design and use RNA sequencing to pro le bronchial epithelial cell from hookah smoking generated with different heating ways and tobacco or herbal shisha, including 30 samples among ve conditions (GEO accession GSE162386). For our hookah-smoking dataset, we developed an integrated genome-wide model, called hierarchical-systems biology model (HiSBiM) with four-layered analysis, to systematically investigate the effects of smoking hookah and to compare the biological pathways with several public cigarette smoking datasets. Based on HiSBiM, we identi ed DEGs and pathways induced by hookah smoking and demonstrated their main differences in infection-related subsystems with cigarette smoking. These most likely is caused by polluted hookah devices or pathogen contaminated water lters. We also showed that cardiovascular disease and atherosclerosis were activated after acute exposure to hookah smoking by charcoal combustion. These results may relate to combustion components carbon monoxide and polyaromatic hydrocarbons compared to hookah smoking by electronic way. Overall, our results show that hookah smoking could involve a higher risk in several diseases than cigarette smoking and link the relations between hookah constituents to diseases.
We believe that our HiSBiM is a useful integrated method for providing multiple-level valued insights for genome-wide analysis on various omics data sets.