Multivariate Analysis of Potential Pleiotropic Genes For Breast, Ovarian And Cervical Cancers Using Gene-Based Association Analysis

Although genome-wide association studies (GWAS) have a dramatic impact on susceptibility locus discovery in gynecological malignancies, the single nucleotide polymorphisms (SNPs) identied by this prevailing univariate approach only explain a small percentage of heredity. The extensive previous studies have repeatedly shown breast, ovarian and cervical cancers have common genetic mechanisms and the overlapping pathophysiological pathways. Novel multivariate analytical methods are necessary to identify shared pleiotropic genes. In this study, a total of 40,859 SNPs mapped in 11,597 gene regions were performed to identify potential common variants by using metaCCA and VEGAS2 analysis. Gene enrichment and protein-protein interaction (PPI) network analysis were used to explore potential biological pathways and connectivity. After metaCCA analysis, 4,203 SNPs (P<1.22×10 −6 ) and 1,886 pleotropic gene (P<4.31×10 −6 ) were identied. By screening the results of gene-based P-values, the existence of 3 conrmed pleiotropic genes and 16 novel genes that achieved statistical signicance in the metaCCA analysis and were also associated with at least one cancer in the VEGAS2 analysis were identied. The enrichment analysis showed the biological pathways of these genes were mainly enriched in 4 signaling pathways and 11 differentially expressed genes were found to encode interacting proteins in PPI network analysis. Altogether, we identied novel genetic variants of breast, ovarian and cervical cancers and provided evidence of biological functions which developed new insights for the diagnosis and treatment of these cancers.


Introduction
Breast, ovarian and cervical cancers are the most common and lethal gynecological malignancies.
According to the global cancer statistics, about 3 million new cases are diagnosed and their incidence ranked rst in the global burden of cancer (16.4% of the total cases) in 2018 [1]. Also, the epidemiological researches have showed that all three cancers are linked to hereditary and exhibits family clustering [2,3].
Approximately, 180 genetic loci for breast cancer, 40 susceptibility loci for cervical cancer and 40 association loci for ovarian cancer have been identi ed in case-control genome-wide association studies (GWAS) and expression quantitative trait loci studies, which indicated that associations between genes and phenotypes are signi cant risk factors in the development of this three cancers.
Common molecular mechanisms and genetic polymorphism existed widely in the gynecology-related cancers. A molecular study revealed the involvement of autophagy in the development of cervical, endometrial and ovarian cancer [4]. Cross-cancer research showed that variation at the genetic level in uenced the formation of multiple gynecology-related cancers. For example, analysis of gene expression has con rmed the association of variants at the 19p13 locus with estrogen receptor negative breast cancer and ovarian cancer [5]. Seven new cross-cancer loci were identi ed in a comprehensive GWAS meta-analysis of breast, ovarian and prostate cancers and further pathway analysis revealed apoptosis as the common susceptibility mechanism of these three cancers [6]. This suggested the existence of complex and low-penetrance polymorphic genes with shared effects among gynecologyrelated cancers.
GWAS is a commonly used approach for verifying individual single-nucleotide polymorphisms (SNPs) sequentially against a quantitative phenotype measure. However, a comparative study of GWAS and meta-analysis showed the univariate method were limited in detecting the heritability of intricate phenotypes because the widely existed internal correlation of genotype-genotype and phenotypephenotype was ignored [7]. Recent research suggested simultaneously performing multiple related traits analysis could increase statistical power and identify complex genotype-phenotype correlations compared to the univariate GWAS analysis [8]. In addition, with the availability of GWAS summary statistical data, the researchers focus on performing multivariate statistical analysis to identify common genetic variant based on public data.
MetaCCA, a novel systematic statistical analysis method using canonical correlation analysis devised by Cichonska et al [8], was adopted to solve the limitation of GWAS. It aims to identify complex correlations between multiple genotypes and multiple phenotypes through combining several univariate summary statistics. It has been applied to identify novel shared genetic factors involved in the development of psychiatric disorders [9]. In this study, metaCCA was applied to detect potential overlapping pleiotropic genes shared by breast, ovarian and cervical carcinogenesis. In addition, gene enrichment and proteinprotein interaction (PPI) network analysis were performed to explore potential biological function and connectivity among them.

GWAS Datasets
Primary SNP-level summary data for the three cancers were obtained from the Breast Cancer Association Consortium (BCAC), the Ovarian Cancer Association Consortium (OCAC) and the GWAS Catalog. The study included 89,677 breast cancer samples (46,785 cases and 42,892 controls), 66,450 ovarian cancer samples (25,509 cases and 40,941 controls), and 9,628 cervical cancer samples (2,866 cases and 6,762 controls) where all individuals were of European ancestry [10][11][12]. The summary statistics, including Pvalues, regression coe cients and standard errors calculated in the meta-analysis are available. Finally, overlapping SNPs of breast, ovarian and cervical cancers were selected for further analysis.

Data Preparation
A total of 294,879 common SNPs were acquired after combining summary statistics for all the three cancers. To select SNPs with small pairwise associations, linkage disequilibrium (LD) based SNP pruning method was used to select SNPs with imputation r 2 ≤ 0.2 in European population. Following this primary selection of SNPs with lower LD, the process was repeated using a sliding window of ve SNPs to con rm complete removal of pairs of SNPs with high LD. The whole process of SNP selection used the HapMap version 3 CEU genotypes panel (Utah residents with Northern and Western European ancestry from the CEPH collection) as a reference. After deleting the SNPs with large LD, 40,967 SNPs was remained. Then, according to the 1,000 Genome datasets, we accomplished the gene annotation for the pruned SNPs using PLINK1.9 which based on the hg19 human genome build (https://www.coggenomics.org/static/bin/plink/glist-hg19). Finally, 40,859 SNPs were mapped 11,597 gene regions were included in the metaCCA analysis. It is worth noting that the regression coe cient beta of each annotated SNP has to be normalized prior to metaCCA analysis [8].

Multivariate MetaCCA Analysis
The purpose of metaCCA analysis was to identify the potential pleiotropic genes for multiple diseases, which required a full covariance matrix (∑), consisting of a cross-covariance matrix between all genotypic and phenotypic variables (∑XY), a genotypic correlation structure between SNPs(∑XX), and a phenotypic correlation structure between traits(∑YY). ∑XY was imputed with the normalized regression coe cient β gp (g and p were the number of genotypic and phenotypic variables, respectively).∑XX was estimated using the reference SNP dataset of the HapMap 3 CEU. Each ∑YY entry corresponded to a Pearson correlation coe cient between vectors of β estimates from p phenotypic variables across g genetic variants [13]. It has been proved that the more the number of genotypic variables there are, the less the error of the estimate there is [14]. Therefore, 294,879 overlapping SNPs were used as the estimation of ∑YY, even if only some of the SNPs were included in further analysis. The canonical correlation coe cient r was used to describe the association between genotypes and phenotypes.
In this study, we performed association analysis to identify canonical correlations in the SNP and gene levels, respectively. In the SNP level, we selected susceptibility loci signi cantly associated with the three cancers using univariate SNP-multivariate phenotype association analysis. In addition, the GWAS summary statistics were examined to con rm random selection of the sample of 40,859 SNPs. The results revealed that the standardized mean of β approached zero and the corresponding median P-value approached 0.5 for all the three datasets, thus con rming that the sample of SNPs was from a random sample of the whole genome. The signi cant level α (two-sided) was corrected with Bonferroni method (adjusted α = 0.05/40,859). Canonical correlation r of any SNP was signi cant when the P-value was smaller than 1.22 × 10 − 6 (= 0.05/40,859). In the gene level, we conducted multivariate SNP-multivariate phenotype association analysis to obtain a canonical correlation coe cient between a gene and the three cancers. Similarly, Canonical correlation r of any gene was signi cant when the P-value was smaller than 4.31 × 10 − 6 (= 0.05/11,597).
As a complement to the metaCCA analysis, we performed Versatile Gene-based Association Study-2 (VEGAS-2) analysis to re ne the identi ed genes (https://vegas2.qimrberghofer.edu.au/). This method was a gene-based approach considering associations between a trait and all SNPs within a gene rather than each SNP individually [15]. All SNPs in each gene were incorporated into the VEGAS-2 analysis and then P-value was calculated for each trait in the gene level. Ultimately, we selected the pleiotropic genes associating with at least one phenotype using the adjusted threshold (= 1× 10 − 6 ).

Functional Annotation
For identifying biological functions of the signi cant pleiotropic genes in our study, we explored the potential biological pathways using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Gene enrichment analysis was performed using KOBAS3.0 website developed by Center for Bioinformatics, Peking University (http://kobas.cbi.pku.edu.cn/index.php). In the enrichment analysis, P < 0.01 indicated statistically signi cant differences. To evaluate functional connectivity of genes, PPI network analysis were performed through Search Tool for the Retrieval Interacting Genes (STRING11.0) database (https://string-db.org/) [16]. We considered the total score above 0.400 that correspond to the combination of the following 7 different scores: text mining, experiments, databases, co-expression, gene neighborhood, gene fusion and gene co-occurrence.

Data availability
This data is publicly available from the BCAC, OCAC and GWAS Catalog.

Results
A two-step analysis strategy was performed to identify common SNPs and genes associated with breast, ovarian and cervical cancers. First, associations between multiple genotypes and multiple phenotypes were evaluated using metaCCA. The next step was to validate these pleiotropic genes for their speci c associations with single cancer using the VEGAS-2 method. In addition, we performed gene enrichment and PPI network analysis to explore potential biological function and connectivity of these pleiotropic genes.

Pleiotropic SNPs and Genes Identi ed by MetaCCA Analysis
After the gene annotation and SNP pruning, we performed metaCCA analysis with a total of 40,859 SNPs located in 11,597 gene regions. In the univariate SNP-multivariate phenotype analysis, we obtained 4,203 signi cant SNPs (P < 1.22 × 10 − 6 ), and the canonical correlation coe cient r ranged from 0.018 to 0.083. The associations between SNP and phenotype are represented in the Manhattan plot (Fig. 1). If the -log_10 (metaCCA) value of a SNP was more than 5.91(=-log 10 (1.22×10 -6 )), this SNP was regarded as a potential pleiotropic SNP for the 3 cancers. In the multivariate SNP-multivariate phenotype analysis, we identi ed 1,886 signi cant genes (P < 4.31×10 − 6 ), and the canonical correlation r of genes ranged from 0.022 to 0.680.

Re ning the Pleiotropic Genes by VEGAS-2 Analysis
To evaluate the relationship between genes and cancers identi ed by metaCCA, we performed VEGAS-2 analysis to identify genes associated with one of the three traits. Using the VEGAS-2 methods, 98 genes were found to be related only with one trait (18 genes for cervical cancer, 61 genes for breast cancer and 20 genes for ovarian cancer).
By combining the results, 19 putative pleiotropic genes were con rmed to have a statistical signi cance in the metaCCA analysis and were associated with at least one trait in the VEGAS-2 analysis. According to the VEGAS-2 analysis, 3 pleiotropic genes (EHMT2, LST1, LTA) were found to be associated with cervical cancer, 14 pleiotropic genes were found to be associated with breast cancer and 3 pleiotropic genes (BNC2, CRHR1, MLLT10) were found to be associated with ovarian cancer. The pleotropic gene, MLLT10, was associated with ovarian cancer and breast cancer based on VEGAS-2 analysis (P < 1×10 − 6 ). The ndings of the metaCCA and VEGAS-2 analysis were summarized in Table 1. Particularly, among the 19 putative pleiotropic genes identi ed in this study, 3 (CRHR1, LTA and MLLT10) were previously reported to have an association with more than one trait. While the other 16 were novel pleiotropic genes, 10 genes (BNC2, CASC8, CCDC170, ESR1, FGFR2, ITPR1, MAP3K1, NEK10, PEX14, and TNNT3/LSP1) have been previously con rmed to be associated with one of the three cancers but was identi ed to be associated with all three cancers in our study. The remaining 6 genes (CASC21, EHMT2, FAM227A, LOC100506674, LST1, and SYT8) might represent novel pleiotropic candidate genes for breast, ovarian and cervical cancers. The detailed features of the 19 signi cant pleiotropic genes were shown in Table 2. Con rmed: Genes were associated with more than one cancer identi ed by previously studies.
Novel*: Genes were associated with only one cancer identi ed by previously studies.
Novel: New genes identi ed by this study.
Gene Enrichment and PPI Analysis When signi cant genes associated with the three cancers were analyzed with KOBAS3.0 online analysis, the genes were mainly enriched in the gonadotropin-releasing hormone (GnRH) signaling pathway, estrogen signaling pathway, proteoglycans in cancer and mitogen-activated protein kinase (MAPK) signaling pathway. Details of the functional enrichment analysis results were shown in Table 3. PPI network analysis showed that 11 differentially expressed genes were found to encode interacting proteins (Fig. 2).

Discussion
In this study, 19 pleiotropic genes were identi ed to associate with breast, ovarian and cervical cancers, including 3 con rmed pleiotropic genes and 16 novel pleiotropic genes. Gene enrichment and PPI network analysis showed shared biological function and connectivity of the three cancers which developed new insights for the diagnosis and treatment of gynecology-related cancers.
Three con rmed pleiotropic genes (CRHR1, LTA, and MLLT10) have also been identi ed in this study, implying that these genes exerted a crucial in uence on gynecology-related cancers. CRHR1 increased the expression of Fas ligand through corticotropin-releasing hormone and positive expression of Fas ligand was associated with higher ovarian tumor stage [29]. However, the activation of CRHR1 diminished breast cancer progression through participating in Urocortin-repressed transforming growth factor β1 signaling [30]. LTA was a member of the tumor necrosis factor family, which encoded a cytokine produced by lymphocytes [2]. LTA locus variants was likely to increase the risk of gynecology-related cancers including breast and cervical cancers [31]. In addition, LTA gene was signi cantly associated with increased risks of cervical and vulvar cancers according to gene-based analysis [25]. MLLT10 has been reported to associate with breast and ovarian cancers [27]. Interestingly, the genes were also identi ed to associate with breast and ovarian cancers in VEGAS-2 analysis and with all three cancers in metaCCA analysis. Potential biological mechanisms of the pleiotropic gene may include inhibition of cell cycle arrest and apoptosis, impaired Deoxyribonucleic acid repair and myelopoiesis regulation [32].
The other 10 novel pleiotropic genes associated with one of the three cancers in previously studies, some of them were showed in the PPI network co-expressed with each other suggesting important roles of these genes in gynecology-related cancers. BNC2 was highly expressed in endometrium and ovary, which indicated an important role of BNC2 in the development of other gynecology-related cancers. ESR1 played a crucial role in the etiology of breast cancer by regulating estrogen signal transduction. ESR1 expression controlled a feed-forward that sustained activation of the CXCR7/CXCL11 chemokine axis to induce the metastatic behavior of ovarian cancer cells [33]. Besides, loss of estrogen receptor-α had a major role in mediating cervical cancer invasion and progression [34]. FGFR2 was a member of the broblast growth factor receptor family, which encoded the broblast growth factor receptor. FGFR2 aberrations might affect multiple gynecology-related cancers, such as breast, ovarian, cervical and endometrial cancers [35]. ITPR1 was identi ed as an autophagy gene and signi cantly associated with overall survival in breast cancer [36]. Pathway analysis also showed that ITPR1 and ESR1 were jointly involved in the estrogen signaling pathway that promoted breast cancer progression in cellular.
Compared with genes identi ed in previous GWAS studies of these three cancers, the remaining six genes (CASC21, EHMT2, FAM227A, LOC100506674, LST1, and SYT8) have never been reported. Research showed that CASC8 and CASC21 were a hotspot gene integrated by human papilloma virus and promoted the development of cervical cancer [37]. LST1 was con rmed to correlate with relapse-free survival and distal metastasis-free survival in triple-negative breast cancer and general breast cancer [38]. Another important gene, EHMT2 encoded a methyltransferase and higher EHMT2 expression was found to be associated with an adverse prognosis with breast cancer patients and predicted a shorter overall survival in ovarian cancer patients. Therefore, CASC21, EHMT2 and LST1 might have important effect in the development and therapeutics of other gynecology-related cancers. Detailed biological mechanism of 6 genes remain unclear, and further experiments might need to be conducted to identify the new ndings.
Compared with the standard single phenotype GWAS, this cost-effective analytical method provided advantages for the exploration of genetic polymorphisms. First, higher statistical power and larger sample size were obtained by combining the summary statistics of the three large GWAS. Second, present analysis of multiple related traits could contribute to richer ndings and increasing the possibility of discovering associations that are more complicated. Inevitably, there were also some limitations in our study. First, the research was unable to determine the proportions of variability explained by the identi ed genes due to shortfall of detailed primary individual measures. Second, our ndings were based on data analysis without experimental approach. Additional experiments might be necessary to con rm the novel genes of gynecology-related cancers in further studies.

Conclusion
In summary, we performed multivariate statistical analysis of breast, ovarian and cervical cancers using metaCCA and VEGAS-2 based on summary statistics, identi ed 3 con rmed pleiotropic genes in the previous study and 16 signi cant genes that may be the novel pleiotropic candidate genes associated with all three cancers, which developed new insights for the diagnosis and treatment of gynecologyrelated cancers. Declarations