We first combined the association p-values of variants by meta-analysis from two independent GWASs. Association analyses at genes and multiple knowledge-based gene-sets were carried to prioritize potential HBV-related HCC susceptibility genes. A series of prioritized variants were selected to replicate their genetic associations in a group of independent case-control samples. The overall workflow is shown in Figure 1.
Genome-wide meta-analysis of two HBV-related HCC GWASs in Chinese populations
Association p-values were imputed based on the linkage disequilibrium (LD) pattern in the Eastern Asian Panel from the 1000 Genomes Project. A genome-wide meta-analysis was then performed with SNP p-values from two existing Chinese HCC GWASs using the tool FAPI . After quality control (QC), 5,375,073 meta-analysis p-values of SNPs were obtained. The Manhattan plot and QQ plots of p-values are shown in Supplementary Figure 1 and Supplementary Figure 2, respectively. At the upper tail of the QQ plot, there is a deviation from the 95% confidence level of the non-hypothesis line, suggesting the existence of association signals at some SNPs. The small proportion of significant signals was consistent with the estimated low heritability in the samples by GCTA, 0.063 (±0.028) on the underlying liability scale .
Gene-based association analysis
We then used the meta-analysis p-values for gene-based association analysis by GATES  and ECS  on a tool called KGG (version 3.5) . In addition to SNPs within the untranslated regions, introns and exons, the meta-analysis p-values of SNPs within 5kb upstream and downstream of a gene were also included in the gene-based association test by GATES and ECS. SNPs in overlapping regions of multiple genes were assigned to all involved genes. The QQ plots of gene-based p-values are shown in Figure 2.
According to the gene-based p-values by GATES, two genes, SLC39A8 and GOLGA8M passed the multiple-testing correction by FDR, 0.05 (Table 1). In addition, two genes, SMIM31 and WHAMMP2, have nearly significant q-values (<0.06 by GATES) on the genome (Table 1). Interestingly, SMIM31, encoding small integral membrane protein 31, was annotated as a long noncoding RNA gene (LINC01207) previously. We further annotated the pseudogene, WHAMMP2, with known regulatory elements and epigenomic markers by the UCSC genome browser (http://genome.ucsc.edu). Although it is annotated as a pseudogene, there are multiple regulatory factors binding sites and epigenomic markers in WHAMMP2 (See Supplementary Figure 3). These annotations imply that this gene is also functionally active despite not encoding proteins. The other gene-based test, ECS, detected no significant gene. The gene with smallest p-value (7.5E-06) is RNF157-AS1.
Prioritization of genes in different gene-sets
To select more promising candidate genes for replication in independent samples, we resorted to a series of gene-set resources to prioritize genes with suggestive association p-values. We first examined the association with HCC in 1,057 canonical pathways curated in the Molecular Signatures Database (MSigDB V 4.0), after removing the pathways containing too few (<5) or too many (>300) genes. The gene-set-based association p-value was performed by LDRT  on KGG. Although no gene-sets passed multiple testing (FDR q<0.05), several promising functional gene sets are prioritized. The top two gene sets according to the p-value are the cell cycle G1/S transition (p=5.5E-4) and the NOTCH1 intracellular domain regulates transcription (p=7.1E-4). In the G1/S transition gene set, 12 out 99 genes had gene-based association (p<0.05, See details in Supplementary Excel Table 1). The gene with the smallest p-value is CDC45 (p=1.1E-4) in this gene set. In the gene set of NOTCH1 intracellular domain regulates transcription, 10 out 40 genes had gene-based association (p<0.05, See details in Supplementary Excel Table 1). In the set, NCOR1 had the smallest p-value (p=5.8E-3). The second gene, KAT2A, had similar p-value (6.6E-3).
Then, we investigated whether the genes highly and specifically expressed in human liver were associated with HCC. In the database, Tissue-specific Gene Expression and Regulation (TiGER, http://bioinfo.wilmer.jhu.edu/tiger/), 309 genes preferentially expressed in liver were retrieved. In the human proteome atlas (http://www.proteinatlas.org/humanproteome), 433 genes showing elevated expression of proteins in liver compared to other tissue types were retrieved as well. To reduce potential false positives, we only used overlapping genes in the two sets. As a result, a total of 189 genes were obtained. Three genes (PAH, UGT2B10 and UROC1) had the FDR q values <0.1 by ECS while GATES did not detect any significant gene (See the genes and p-values in Table 2 and Supplementary Table 1).
We also examined the association of recurrent integrated genes by HBV reported in previous studies [27-30], the genes reported to be genetically associated with HBV-related HCC risk in previous studies, and HCC risk genes defined by COSMIC database (http://cancer.sanger.ac.uk/cosmic). However, none of the genes had a promising association p-value with HCC in our samples (see the genes and p-values in Supplementary Table 2-4).
Replication study in independent samples
We replicated genetic association at genes prioritized by the above gene-based and gene-set-based associations in a group of independent HBV-related HCC case-control samples. Due to budget limit, only 21 SNPs were selected for the replication. The SNPs were at prioritized genes according to consistency of their allele frequencies in ancestry matched reference panel in the 1000 Genomes Project and HapMap Project, and/or their predicted functional importance by RegulomeDB (http://regulomedb.org/) with regulatory elements (See examples in Supplementary Figures 3 and 4). After the genotype quality assessment, two SNPs were excluded because they failed to pass the Hardy-Weinberg equilibrium test (p<0.001).
Three genetic models (additive, dominant and recessive) were considered under a logistic regression framework in which the HCC status was adjusted for sex and age. None of the 19 SNPs survived the multiple Bonferroni correction for family-wise error rate 0.05. Only two SNPs, rs17343667 and rs389883, had a nominal p-value below 0.05. The rs17343667, which is located in the first intron of EIF2AK1, had an association p-value equal to 0.02 under the dominant model with an odds ratio of 1.27 for the minor allele, which was found to have a risk effect in both original Qidong and Hong Kong GWAS samples (Table 3). However, its p-value was only 0.15 under the additive model. The rs389883, which is in intron region of STK19, had p-values of 0.026 and 0.032 for HCC association under additive and recessive models, respectively, with a protective effect at the minor allele G. However, in the original Qidong GWAS sample and Hong Kong GWAS sample, G was estimated to have a risk effect. Therefore, the SNP-level replication was generally negative.