An eﬀective gene-based rare variant association analysis pipeline for case–control studies of disease

Background: In complex disease studies, genome-wide association study (GWAS) has been successfully used for identifying associated genetic risk loci. In fact, only a small fraction of the apparent heritability can be explained by common variants. Because research eﬀorts have largely focused on common genetic variants, the missing heritability could be mostly due to rare genetic variants. Substantial research eﬀorts have been devoted to developing software for genotype imputation and designing variant binning strategies and statistical methods for rare variant association testing of datasets on GWAS chips. However, few systematic pipelines have been proposed to identify rare disease-related genes. Results: We present EGRVA, an Eﬀective Gene-based Rare Variant Association analysis pipeline for genotype imputation, quality control, gene-based functional annotation, statistical analysis, and bioinformatics analysis of identiﬁed genes. As a complementary pipeline for rare variant analysis on GWAS chips, EGRVA is relatively straightforward and cost-eﬃcient. Furthermore, we tested the EGRVA pipeline with the preterm birth (PTB) dataset from the GPN-PBR. We focused on the 6 genes identiﬁed by EGRVA: FLG, HRNR, PMS1, ATM, OR2AG1 and SLC22A25. We also explored the underlying biological interpretation of these potentially signiﬁcant genes. Conclusions: As a complementary pipeline for rare variant analysis on GWAS chips, EGRVA is relatively straightforward and costeﬃcient. The application of the pipeline will contribute to the support of rare variants to explain the missing heritability by eﬀectively discovering genes related to disease.


Background
With the rapid development of DNA sequencing technology, an increasing number of susceptibility loci for some complex diseases have been revealed by GWAS ( [1,2,3,4]). These studies are mainly based on common genetic variants, typically with Minor Allele Frequency (MAF) > 5%. Although susceptibility loci for many diseases has been identified, much of the estimated heritability for complex traits has not been explained. Even though a large GWAS meta-analysis has been performed for some diseases, much of the heritability remains unexplained ( [5,6]). For instance, in the genome-wide meta-analysis of Crohn's disease, although the number of confirmed susceptibility loci increased to 71, they explained less than a quarter of the heritability. ( [7]).
For the problem of missing heritability, several explanations have been put forward ( [8,9]). Among them, rare variants attract attention because they are theoretically and empirically responsible for a fraction of the missing heritability. Theoretically, deleterious alleles are likely to be rare due to purification selection ( [10,11]). In fact, loss-of-function variants that block the generation of functional proteins are particularly rare ( [12,13]). There is also empirical evidence that rare variants are associated with complex diseases ( [14,15]). For example, many rare forms of common diseases and Mendelian diseases are caused by highly permeable rare variants ( [16]). Therefore, the identification of rare variants associated with traits and diseases is becoming increasingly important.
With the gradual increase in research into rare variants, various studies have been proposed, but there are only a few works on pipeline development for Rare Variant Association Study (RVAS). Zuk et al. described a conceptual framework to address some key questions, including sample size required to detect association, relative merits of testing disruptive alleles and missense alleles, and frequency thresholds for filtering alleles ( [17]). Seunggeung Lee et al. presented an analysis pipeline for RVAS and reviewed some existing cost-effective sequencing designs and genotyping platforms as well as statistical issues in RVAS with a focus on study designs and statistical tests ( [18]). However, these pipelines provide limited functionality and are not a complete analysis pipeline for GWAS dataset. In addition, these pipelines lack methods to explore the potential biological interpretations of discovered "risky" variants.
To tackle these problems, we present an Effective Gene-based Rare Variant Association analysis (EGRVA) pipeline for case-control studies to identify rare diseaserelated genes. The EGRVA pipeline is for rare variants obtained from GWAS panels and combines novel methods with pre-existing tools. In this study, we present the EGRVA and show how to perform data pre-processing before imputation and how to use external functional annotation information to perform statistical tests to improve the statistical power. In addition, we validate the molecular genetics of some identified variants by using several well-known bioinformatics tools.

Methods
EGRVA accepts data in PLINK format as its input and involves the following steps: genotype imputation, Quality Control (QC), gene-based functional annotation, statistical analysis implemented with a Bayesian method for MIxture model based Rare variant Analysis on GEnes (MIRAGE), and further biological interpretations of the variants in the identified significant genes. The complete workflow is described in Figure 1.

Genotype imputation
Genotype imputation is an important step before rare variant genetic association studies because imputation accuracy increases the number of haplotypes, especially for rare variants ( [19,20,21]). For imputation in our pipeline, we use the Michigan Imputation service with Minimac4 ( [22]). To improve the accuracy of the imputation, we choose a large reference panel, the Haplotype Reference Consortium (HRC) to provide accurate genotype imputation at MAF as low as 0.1% ( [23]). To further guarantee the imputation effect and save imputation time, we included three data pre-processing steps before imputation. In the first step, a preliminary QC procedure is executed by PLINK software ( [24]). The second step involves using LiftOver (http://genome.ucsc.edu/) to convert the genome coordinates into hg19 to satisfy the requirements of the Michigan imputation server. We chose SHAPEIT2 as the phasing tool because it combines the best features of SHAPEIT1 and IMPUTE2 ( [25,26]), which might improve efficiency and accuracy when estimating haplotypes from genotype data. The final step is to check the PLINK bim file against the HRC reference Single Nucleotide Polymorphism (SNP) list with the script in McCarthy Group Tools (https://www.well.ox.ac.uk/˜wrayner/tools/). In this step, we update strand, position and ref/alt assignment, and remove SNPs with differing alleles, allele frequency differences greater than 0.2, and not in the reference panel. In addition, we set the R2 (squared Pearson correlation between imputed and experimental genotypes) to 0.3 to minimize the imputed file size when we start our imputation.

Quality Control
After imputation, BCFTOOLS is used to filter SNPs with R2 < 0.8 and change SNP ID to CHROM: POS: REF: ALT to avoid duplication. Next, we use PLINK software again for QC ( [27]). Because this pipeline focuses on rare variants, SNPs with MAF > 0.5% and < 1% are retained. In addition, SNPs with call rate > 10% and samples with call rate > 5% and Hardy Weinberg Equilibrium P-value < 1.00e-5 are eliminated. We also perform a multi-allele check to remove multiallelic SNPs.

Gene-based functional annotation
Functional variants, such as non-synonymous SNPs, are more likely to be rare and more prone to disrupt gene function ( [28]). They are hypothesized to have greater expected impact on phenotypic development than other variants. Non-synonymous variants are exonic, lying in the coding regions of genes. They are able to disrupt the coding sequence of genes, thereby resulting in malformed and dysfunctional protein products. In this pipeline, we apply ANNOVAR to query the refGen database of functional effect prediction ( [29]). Then, we select exonic non-synonymous variants and use three tools, including polyPhen2 HDIV scores ( [30]), SIFT scores ( [31]) and CADD scores ( [32]), to classify the non-synonymous variants as damaging or non-damaging variants.

Statistical analysis
The single variant test method assesses rare variants with inadequate power ( [16]). Instead of testing the effects based on a single variant, it is more powerful to run the association test based on multiple variants in a biologically relevant region. Many statistical methods have been proposed, such as CMC ( [33]), the Variable Threshold approach ( [34]) and C-alpha ( [35]). Despite favorable performance, methods based on multiple variants feature unrealistic assumptions that all rare variants in a gene have a non-zero effect or zero effect together. To deal with this problem, Shengtong Han et al. develop a Bayesian method for MIRAGE ( [36]). The key idea of this method is to model variants in a gene as a mixture of risk variants and non-risk variants. Each variant has a prior probability of being a risk variant, which depends on the functional annotations of the variant. This prior probability reflects the sparsity of risk variants and better accounts for heterogeneity of variant effects within a gene.
We use MIRAGE for the association analysis and performed it with the MIRAGE R-package. First, we separately count the number of rare variant alleles at each locus in the case group and the control group based on the results of ANNOVAR. Then, we group each variant into different proportions of risk variants based on functional annotation. Deleterious groups have a higher risk proportion. Finally, we calculate each gene's Bayes Factor (BF) to assess its association with the disease by comparing the likelihood of the full model (risk gene) to the likelihood of the null model (non-risk gene) ( [37]). BF is similar to the likelihood ratio test. For example, a BF > 1 provides evidence for rejecting the null model and for the presence of a filler effect.

EGRVA identifies putative risk genes in the PTB dataset
We used EGRVA with the PTB dataset to demonstrate that this pipeline can successfully identify risky genes. The dataset used for the analyse was obtained from the database of Genotype and Phenotype (dbGaP) found at (http://www.ncbi.nlm.nih.gov/gap). In total, 743 singleton pregnancies with spontaneous preterm birth (from 20 to < 34 weeks' gestation) were defined as the case group and 752 pregnancies (from 39 to < 42 weeks'gestation) with spontaneous onset of labor were defined as the control group. In total, data for 76 pregnancies without genotypes were removed. We focused on 702 samples in the case group and 717 in the control group, with 868,278 SNPs retained.
QC was performed before imputation, and two samples with a call rate > 5% were deleted. Overall, limiting the R2 > 0.8, 1,417 samples and 67,975 variants were retained after imputation. QC was implemented again. 1,417 samples and 64,939 variants remained after QC. Before the association study, we used ANNO-VAR to identify 196 exonic non-synonymous variants. Then we set the variants with PolyPhen2 HDIV score > 0.975, CADD score > 20, or SIFT score < 0.05 as group 2 (high-risk group). Others were the low-risk group. In total, 1,417 samples with 196 exonic non-synonymous variants were analyzed using the MIRAGE method. Genes with BF > 10 and posterior probability (post.prob) > 0.8 simultaneously had top potential. Table 1 shows the results of the 6 genes identified by BF > 10 and post.prob > 0.8.

Bioinformatics analyses of the identified genes
PTB is a polygenic disease. Rare variants in identified genes were investigated for their involvement in innate immunity and inflammation. Below is a detailed analysis.
The expression of Filaggrin (FLG), with three SNPs (rs7537147, rs12073613, rs113652604), is highest in skin and is absent from tissues related to mother-andchild interactions such as uterus, placenta, or mammary glands ( [38]). Hornerin (HRNR), with two SNPs (rs6659183, rs138421943), is abundant in barrier organs, such as uterine cervix and placenta. It can regulate protein phosphorylation and inflammatory and immune reactions ( [39]). Furthermore, abundant HRNR can protect organs such as central nervous system and female gonads against the potentially damaging effects of an inflammatory immune response. Post-meiotic segregation increased 1 (PMS1), with three SNPs (rs1145231, rs1145232, rs1145234), is involved in the repair of errors that occur during DNA replication. The DNA mismatch repair (MMR) system has a significant relationship with human fertility ( [40,41]). ATM, with SNPs rs1800056 and rs3218673, belongs to the PI3/PI4-kinase family. It can activate p53. Uterine-specific p53 deficiency confers premature uterine senescence and promotes preterm birth in mice ( [42]). For OR2AG1, with SNPs rs74057919 and rs74057920, it is found that some tocolytics acting via GPCR signal-mediated pathways are involved in myometrium relaxation and contraction ( [43]). Previous studies showed that the organic anion transporter (OAT) subfamily constitutes approximately half of the solute carrier 22 (SLC22) transporter family, including SLC22A25, with SNPs rs17157907 and rs35722529 ( [44]). OAT members are expressed in many tissues such as kidney, liver, and placenta ([45]).

Conclusions
We described a straightforward rare variant analysis pipeline called EGRVA for analyzing GWAS data and validated it with a dataset for PTB. The EGRVA pipeline is easy to implement and is sensitive to risky genes. We verified the effectiveness of the proposed pipeline by bioinformatics analyses of the associated genes with the PTB dataset. The EGRVA pipeline can be an effective supplement to nextgeneration sequencing of rare variants. However, several limitations still exist in the EGRVA pipeline. Indeed, environmental effects related to disease are not considered, and the sample size of the current GWAS dataset is still not large enough.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
The data in this paper are from the public database platform and have been approved for publication.

Availability of data and materials
The PTB dataset that support the findings of this study was obtained from the database of Genotype and Phenotype (dbGaP) found at (http://www.ncbi.nlm.nih.gov/gap) (accession number phs000714.v1.p1). Samples were provided by the NICHD-funded Genomic and Proteomic Network for Preterm Birth Research (GPN-PBR). EGRVA is freely available on https://github.com/ruijiali/EGRVA.

Competing interests
The authors declare that they have no competing interests.