We first combined the association p-values of variants by meta-analysis from two independent GWASs. Association analyses at genes and multiple knowledge-based gene-sets were carried to prioritize potential HBV-related HCC susceptibility genes. A series of prioritized variants were selected to replicate their genetic associations in a group of independent case-control samples. The overall workflow is shown in Figure 1.
Genome-wide meta-analysis of two HBV-related HCC GWASs in Chinese populations
Association p-values were imputed based on the linkage disequilibrium (LD) pattern in the Eastern Asian Panel from the 1000 Genomes Project. A genome-wide meta-analysis was then performed with SNP p-values from two existing Chinese HCC GWASs using the tool FAPI.[12] After quality control (QC), 5,375,073 meta-analysis p-values of SNPs were obtained. The Manhattan plot and QQ plots of p-values are shown in Supplementary Figure 1 and Supplementary Figure 2, respectively. At the upper tail of the QQ plot, there is a deviation from the 95% confidence level of the non-hypothesis line, suggesting the existence of association signals at some SNPs. The small proportion of significant signals was consistent with the estimated low heritability in the samples by GCTA, 0.063 (±0.028) on the underlying liability scale.[25]
Gene-based association analysis
We then used the meta-analysis p-values for gene-based association analysis by GATES[13] on a tool called KGG (version 3.5)[26]. In addition to SNPs within the untranslated regions, introns and exons, the meta-analysis p-values of SNPs within 5kb upstream and downstream of a gene were also included in the gene-based association test by GATES and ECS. SNPs in overlapping regions of multiple genes were assigned to all involved genes. The QQ plots of gene-based p-values are shown in Figure 2.
According to the gene-based p-values by GATES, two genes, SLC39A8 and GOLGA8M passed the multiple-testing correction by FDR, 0.05 (Table 1). SLC39A8 encodes a member of the SLC39 family of solute-carrier genes, which may play an important role in autophagy during ethanol exposure in human hepatoma cells.[27] GOLGA8M encodes golgin A8 family member M. However, it has not been linked to cancer although a study suggested that palindromic GOLGA8 core duplicons promoted chromosome microdeletion and evolutionary instability.[28] In addition, two genes, SMIM31 and WHAMMP2, have nearly significant q-values (<0.06 by GATES) on the genome (Table 1). Interestingly, SMIM31, encoding small integral membrane protein 31, was annotated as a long noncoding RNA gene (LINC01207) previously. Its expression has been associated with lung adenocarcinoma[29], pancreatic cancer[30] and colorectal adenocarcinoma[31]. We further annotated the pseudogene, WHAMMP2, with known regulatory elements and epigenomic markers by the UCSC genome browser (http://genome.ucsc.edu). Although it is annotated as a pseudogene, there are multiple regulatory factors binding sites and epigenomic markers in WHAMMP2 (See Supplementary Figure 3). These annotations imply that this gene is also functionally active despite not encoding proteins. The other gene-based test, ECS, detected no significant gene. The gene with smallest p-value (7.5E-06) is RNF157-AS1, an antisense RNA gene. Differential expression between tumor and non-tumor tissue at this gene has been founded in lung cancer[32] and ovarian cancer[33].
Prioritization of genes in different gene-sets
To select more promising candidate genes for replication in independent samples, we resorted to a series of gene-set resources to prioritize genes with suggestive association p-values. We first examined the association with HCC in 1,057 canonical pathways curated in the Molecular Signatures Database (MSigDB V 4.0), after removing the pathways containing too few (<5) or too many (>300) genes. The gene-set-based association p-value was performed by LDRT[15] on KGG. Although no gene-sets passed multiple testing (FDR q<0.05), several promising functional gene sets are prioritized. The top two gene sets according to the p-value are the cell cycle G1/S transition (p=5.5E-4) and the NOTCH1 intracellular domain regulates transcription (p=7.1E-4). In the G1/S transition gene set, 12 out 99 genes had gene-based association (p<0.05, See details in Supplementary Excel Table 1). The gene with smallest p-value is CDC45 (p=1.1E-4) in this gene set. CDC45 encodes cell division control protein 45 and has been linked to many cancers according to its expression, including HCC[34] and colorectal cancer[35]. In the gene set of NOTCH1 intracellular domain regulates transcription, 10 out 40 genes had gene-based association (p<0.05, See details in Supplementary Excel Table 1). In the set, NCOR1 had the smallest p-value (p=5.8E-3). This gene encodes a protein that mediates ligand-independent transcription repression of thyroid-hormone and retinoic-acid receptors, which may regulate de novo fatty acids synthesis in liver regeneration and hepatocarcinogenesis in mice.[36] The second gene, KAT2A, had similar p-value (6.6E-3). This gene encodes lysine acetyltransferase 2A and was linked to HCC. For instance, Majaz et al. suggested that KAT2A may promote human HCC progression by enhancing AIB1 expression.[37]
Then, we investigated whether the genes highly and specifically expressed in human liver were associated with HCC. In the database, Tissue-specific Gene Expression and Regulation (TiGER, http://bioinfo.wilmer.jhu.edu/tiger/), 309 genes preferentially expressed in liver were retrieved. In the human proteome atlas (http://www.proteinatlas.org/humanproteome), 433 genes showing elevated expression of proteins in liver compared to other tissue types were retrieved as well. To reduce potential false positives, we only used overlapping genes in the two sets. As a result, a total of 189 genes were obtained. Three genes (PAH, UGT2B10 and UROC1) had the FDR q values <0.1 by ECS while GATES did not detect any significant gene (See the genes and p-values in Table 2 and Supplementary Table 1). The gene PAH encoding phenylalanine hydroxylase enzyme had the lowest gene-based p-value, 3.5E-4 by ECS. Many studies have implicated this gene in development of HCC. For example, Miller et al. showed p-Chlorphenylalanine effect on phenylalanine hydroxylase in hepatoma cells in culture.[38] UGT2B10 (p=7.9E-4) encodes UDP-Glucuronosyltransferase 2B10. Hanioka et al. showed that expression of UGT2B isoforms (including UGT2B10) was significantly increased by AFB1 in HepG2 cells.[39] UROC1 (p=1.4E-3) encodes enzyme involved in histidine catabolism, metabolizing urocanic acid to formiminoglutamic acid. Zhang et al. showed that UROC1 may play important roles in HCC development, especially alcohol-related HCC development and progression.[40]
We also examined the association of recurrent integrated genes by HBV reported in previous studies,[41-44] the genes reported to be genetically associated with HBV-related HCC risk in previous studies, and HCC risk genes defined by COSMIC database (http://cancer.sanger.ac.uk/cosmic). However, none of the genes had a promising association p-value with HCC in our samples (see the genes and p-values in Supplementary Table 2-4).
Replication study in independent samples
We replicated genetic association at genes prioritized by the above gene-based and gene-set-based associations in a group of independent HBV-related HCC case-control samples. In total, 21 SNPs of the prioritized genes were selected according to consistency of their allele frequencies in ancestry matched reference panel in the 1000 Genomes Project and HapMap Project, and/or their predicted functional importance by RegulomeDB (http://regulomedb.org/) with regulatory elements. After the genotype quality assessment, two SNPs were excluded because they failed to pass the Hardy-Weinberg equilibrium test (p<0.001).
Three genetic models (additive, dominant and recessive) were considered under a logistic regression framework in which the HCC status was adjusted for sex and age. Generally, the independent sample failed to replicate a significant association in the discovery sample after multiple-testing correction. Only two SNPs, rs389883 and rs17343667, had an association p-value below 0.05. The rs389883, which is in intron region of STK19, had p-values of 0.026 and 0.032 for HCC association under additive and recessive models, respectively, with a protective effect at the minor allele G. However, in the original Qidong GWAS sample and Hong Kong GWAS sample, G was estimated to have a risk effect. The other SNP, rs17343667, which is located in the first intron of EIF2AK1, had an association p-value equal to 0.02 under the additive model with an odds ratio of 1.27 for the minor allele, which was found to have a risk effect in both original Qidong and Hong Kong GWAS samples (Table 3). In addition, the regulator potential of rs17343667 was supported by expression quantitative trait locus (eQTL) and TF binding/ DNase peak (scored 1f) in RegulomeDB (See details in Supplementary Figure 4).