Identifying chemicals linked to lung cancer: Integrating genome-wide association studies with a chemical-gene interaction network

Lung cancer is the most common cancer and the leading cause of cancer-related mortality worldwide. Environmental chemicals play a signicant role in tumorigenesis, it is necessary to explore the lung cancer-related chemicals and provide new and treatment. The genome-wide association study (GWAS) summary data of malignant neoplasm of bronchus and lung (C34) were downloaded from the UK Biobank. It includes 1,655 samples and 450,609 controls. DNA methylation proles of non-small cell lung cancer were obtained from the GEO database (GSE75008). A transcriptome-wide association study was applied to detect genes signicantly related to lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD) by FUSION software. Comparative toxicogenomics database (CTD) enables the construction of chemical-gene-disease networks. We conducted a chemical-related gene set enrichment analysis (GSEA) based on GWAS summary data, DNA methylation data, and the CTD to explore the relationships between chemicals and two major histological subtypes of lung cancer. Venn analysis was used to identify the common related chemicals of the GWAS summary data and the DNA methylation data.


Conclusion
By integrating GWAS and chemical-gene interaction networks, we made linkages between various chemicals and lung cancer on genetic basis. Moreover, this study provided new clues for exploring the etiology of lung cancer and a new method for nding the chemicals related to tumor or other complex disease.

Background
Lung cancer is the most common cancer and the leading cause of cancer-related mortality worldwide. It was the most frequent cause of death from cancer in multiple regions of the world. Among them, nonsmall cell lung cancer (NSCLC) accounts for about 85% of all lung malignancies (1,2). Occurrence and development of lung cancer is a complex process, in which many factors play a role.
Cigarette smoking is the strongest risk factors of lung cancer development, and it accounts for 85% of all lung cancer cases in the United States (3). Besides, exposure to ne particulate matter (PM2.5) increases the morbidity and mortality of lung cancer. Studies have reported that the high incidence of female lung cancer in Xuan Wei County (Yunnan Province, China) was linked to the bituminous coal combustion.
According to the above, exposure of chemicals is multiple and mixed, that makes it di cult to accurately assess the association between exposure factors and lung cancer.
Due to the growing availability of human genome data and the developments in bioinformatics, the speci c genes which confer inherited predisposition to lung cancer are revealed by degrees (7). GWAS can identify single-nucleotide polymorphisms (SNPs) in different individuals to see if any variant is associated with a trait (8). It is known that SNPs associated with disease risk tend to be in non-proteincoding genome, which suggests they regulate gene expression by altering the activity of the non-coding regions. Expression quantitative trait locus (eQTLs) has been used to identify associations between risk genotypes and gene expression (9). Furthermore, TWAS have been proposed as an approach to integrate eQTLs analyses with GWAS, in order to explore the genes that are associated with the disease via the genetic regulation of the expression of certain genes (9,10). This method has been used to detect the risk gene of tumor initiation, development, and recurrent. For instance, Mancuso et al used TWAS and identi ed new prostate cancer risk regions (11). In addition to genotype and gene expression associated with lung cancer, accumulating evidence has revealed the relationship of epigenetic alterations and the development of cancer. Kettunen et al identi ed novel DNA methylation changes associated with lung cancer and asbestos exposure which suggested the role of epigenome deregulation in the mechanism of carcinogen-induced malignancies (12).
Since tumorigenesis is the result of the interactions of genetic, epigenetic and environmental factors, it is appropriate to explore chemicals-tumor relationships by synthetically analyzing chemical-gene  (13). By integrating these data to construct chemical-gene-disease networks, CTD, as both a database and a discovery tool, promotes further understanding about the effects of environmental chemicals on lung malignancy (14). In this study, we integrated GWAS, DNA methylation, and the CTD by extending the classic GSEA approach to explore the association between environmental chemicals and lung cancer.

GWAS summary dataset of lung cancer
The UK Biobank is a large and detailed prospective study database with more than 500,000 participants, which has collected abundant phenotypic and genotypic detail (15,16). The large-scale GWAS dataset for lung cancer (C39), which includes 30,798,054 SNPs, was driven from the atlas of genetic associations in the UK Biobank (GeneATLAS, http://geneatlas.roslin.ed.ac.uk) (17). The sample contained 3067 individuals with malignant neoplasm of bronchus and lung, who have been diagnosed or were diagnosed initially between Jul 2012 and Mar 2019. We utilized R tidyverse package for data manipulation, tidying and import.

Transcriptome-Wide Association Study
We utilized FUSION (Functional summary-based imputation) to process the GWAS summary dataset of lung malignancies for tissue-related TWAS analysis (10). TWAS was performed to combine pre-computed gene expression weights and GWAS summary dataset together, then calculate the statistics association between gene expression levels and lung cancer. In this study, we utilized the expression weights of two major subtypes of lung cancer, LUAD and LUSC. The expression weights of LUAD and LUSC RNA array were downloaded and used as reference data in the TWAS of lung malignancies. The gene expression weights reference data of LUSC and LUAD derived from The Cancer Genome Atlas (TCGA) multi-tissue RNA-sEq. It contains 500 LUAD samples and 464 LUSC samples. The gene expression weights and the detailed description of the analysis approach can be found in FUSION website ( http://gusevlab.org/projects/fusion/ ).

Chemical element-gene expression interaction database
CTD (Comparative Toxicogenomic Database, http://ctd.mdibl.org/ ) utilizes controlled vocabularies, ontologies, and structured notation to code and describe the core interactions of chemical-gene, chemical-disease, and gene-disease relationships, which are represented by the inferred chemical-genedisease networks (22). The Swanson ABC model suggests that if A chemical interacts with B gene and independently B gene is directly associated with C disease, then A chemical has an inferred relationship to C disease. Accordingly, we explored the connection between environmental chemical substances with genes and their effects on lung cancer (14). In this study, we download chemicals annotation terms of chemical-gene pairs from CTD and generated chemicals related gene sets. The step of the information retrieval process was described in the study previously(23).

Chemical-related gene set enrichment analysis
Classical GSEA is a method for functional enrichment. It focuses on gene sets, which share common biological function, location and regulation (24). At present, we extended classic GSEA by integrating the chemical-gene interaction networks and TWAS expression association testing statistics of two major subtypes of lung cancer. Speci c in this study, we explore the relationships between chemicals and lung cancer by using a weighted Kolmogorov-Smirnov-like running-sum statistic (25). An empirical distribution of GSEA statistics for each chemical is obtained by performing 5,000 permutations in the statistical tests. According to the GSEA statistics, the empirical distribution of each chemical is arranged to calculate the P value of each chemical (26). The normalized enrichment score represents the overrepresentation of lung cancer associated genes in chemicals related gene set, and the chemicals related gene sets with P < 0.05 are considered statistically signi cant (24,25). Likewise, the DNA methylation data were analyzed using the same approach.

Venn analysis
The common related chemicals of the GWAS summary data and the DNA methylation data were achieved on VENNY 2.1 ( http://bioinfogp.cnb.csic.es/tools/venny/index.html ).

Discussion
As environmental chemicals play a signi cant role in tumorigenesis, it is necessary to nd the tumorrelated chemicals and provide new clues for neoplastic prevention and treatment. However, few methods can apply to measure chemical exposure in vivo e ciently. Traditional epidemiologic investigations generally consume abundant labor, funds and material resources. Moreover, it is di cult to eliminate interference between various exposure factors.
In this study, we focused on the genetic component of lung cancer, the pro les of which are available online. Furthermore, using genes as a medium to explore the relationship between chemical substances and diseases avoids confounding factor and makes the results more robust. Additionally, our approach was performed from both the genomic and epigenetic perspective, for more comprehensive assessment of the link between chemicals and disease. GSEA was originally a computational method that determines whether an a priori de ned set of genes shows statistically signi cant concordant differences between two biological states (24). In this study, we utilized a chemical-related GSEA approach to integrate TWAS, DNA methylation data and CTD. Finally, we identi ed that aluminum, naringenin, and 2acetylamino uorene are LUSC-related chemicals, whereas antirheumatic agents, nickel monoxide and 2amino-2-methyl-1-propanol are LUAD-related chemicals (Fig. 2).
Among these three LUSC-related chemicals, 2-Acetylamino uorene (2-AAF) is widely used as a mature biochemical tool to induces various tumors in laboratory animals in the liver, bladder, kidney, and skin (27). The United States Department of Health and Human Services (HHS) has classi ed 2acetylamino uorene as a human carcinogen based on su cient evidence of carcinogenicity in experimental animals (27,28). Studies have proved that 2-AAF induced signi cant oxidative stress and hyperproliferation by enhancing ornithine decarboxylase activity, DNA synthesis, and ornithine decarboxylase activity (29). Naringenin, a type of avonoid, is common in grapefruit, other fruits, and herbs (30). Studies have shown that naringenin can lead to cell apoptosis by inducing reactive oxygen species (ROS) generation, mitochondrial depolarization, nuclear condensation, DNA fragmentation, cell cycle arrest in G0/G1 phase, and caspase-3 activation (31). Aluminum (Al) is a common element that exists in large amounts and is widely used across the world. There are many ways of exposure to Al, such as the diet, antacids, vaccines, various household products, and antiperspirants (32,33). Al constitutes up to 8.9% of the entire human proteome, which is mainly distributed in the prostate, lymphatic system, brain, lung, and ovaries. The genetic overlap shared between the Al proteome and 12 types of cancer was 18.6%, which includes lung adenocarcinoma, lung squamous cell carcinoma and breast cancer (34).
Accumulating evidence has demonstrated that high levels of aluminum are related to breast cancer and other pathological conditions, such as dialysis dementia, osteomalacia and neurodegenerative diseases (35). Combined with these studies, the potential association of Al and LUSC provides clues for subsequent mechanisms research and epidemiological studies.
With regard to LUAD-related chemicals, some of which have been reported by previous study. Nickel monoxide (NiO) is a common inorganic compound widely used in glass, ceramics, lithium-ion batteries, electrochemical sensors and biosensors (36). The National Toxicology Program of HHS has reported increased incidences of lung cancers among workers in certain nickel-re ning facilities (37). NiO nanocomposites have cytotoxic effect. These substances were found to increase the level of ROS, reduce cell viability, and modify cell cycle arrest (38). Antirheumatic agents, also known as "Disease-Modifying Anti-Rheumatic Drugs" (DMARDs), include methotrexate, hydroxychloroquine, sulfasalazine, penicillamine, azathioprine, and the thiopurines. Some of the DMARDs, such as methotrexate and rituximab, also can be used as clinical anti-cancer drugs. Whereas thiopurines and anti-TNFα preparations were reported to increase the risk of skin cancer and lymphoma (39). Notably, there is now a consensus over the contraindication of tumor necrosis factor inhibitors in patients at risk for lymphoma (40). Due to the wide variety and complex mechanism of DMARDs, the relationship between DMARDs and LUAD needs further study. 2-Amino-2-methyl-1-propanol (AMP) is a multifunctional additive used as dispersant, solvent, emulsi er, defoaming agent, and neutralizing agent, which is present in cosmetic lotions, spray hair gels, household detergents, dyes and pigments, paints, and pesticides, among other products (41). The AMP toxic effects were studied by Hossy et al in albino hairless mouse models, and they observed that sunscreen plus irradiation and vehicle plus irradiation lead to the thickening of the epidermis and increased dermal cellularity (42). However, the relationship between AMP and tumor is not clear. Its role in occurrence and development of lung cancer can be clari ed by further study on the mechanism of AMP in cell proliferation.
Besides, there are limitations in our study. Firstly, the chemical-gene interaction networks based on the chemical-gene interaction information provided by CTD. The accuracy of the interaction information may affect the result of the chemical related GSEA. Secondly, this study based on genomics data and untiled integrating bioinformatics analysis to identi ed potential environmental chemicals associated with lung cancer. Nevertheless, our results did not provide the information of molecular mechanisms of these identi ed chemicals. Therefore, we will conduct further experimental research to con rm our result and explore the molecular mechanisms of these chemicals in the pathogenesis of lung cancer.

Conclusion
In this study, we performed a chemical-related GSEA, which was based on GWAS summary data, DNA methylation data and CTD, to explore the association between environmental chemical substances and lung cancer. This study made linkages between various chemicals and lung cancer. In addition, it provided new clues for exploring the etiology of lung cancer. The study certi ed the capacity of the GSEA approach to detect the relationships of environmental chemicals, genes, and diseases, which can open a new way to understand other complex diseases deeply. Venn diagram for common related chemicals shared by the GWAS summary data and the DNA methylation data.  AdditionalFile1.docx