Optimal Genomic Control in Large-scale Genetic Associations for Binary Diseases

Complex computation and approximate solution hinder the application of generalized linear mixed models (GLMM) into genome-wide association studies. We extended GRAMMAR to handle binary diseases by considering genomic breeding values (GBVs) estimated in advance as a known predictor in genomic logit regression, and then controlled polygenic effects by regulating downward genomic heritability. Using simulations and case analyses, we showed in optimizing GRAMMAR, polygenic effects and genomic controls could be evaluated using the fewer sampling markers, which extremely simplied GLMM-based association analysis in large-scale data. In addition, joint analysis for quantitative trait nucleotide (QTN) candidates chosen by multiple testing offered signicant improved statistical power to detect QTNs over existing methods.


Introduction
Although usually expressed as binary phenotypes, many disease traits in plants and animals are thought to be controlled by a number of loci each having a small effect 1,2 . Thus, random polygenic effects excluding the tested markers should be considered in genome-wide association study (GWAS) for disease traits to correct the population strati cation and cryptic relatedness, as linear mixed model (LMM) does for quantitative traits 3,4 . Logistic regression, as a kind of generalized linear model (GLM) 5,6 , has been used earlier for genome-wide association analysis with the disease traits 7 . Despite correction for xedeffect covariates [8][9][10] , logistic regression still produces in ation of association test statistics. Therefore, it is necessary to introduce a generalized linear mixed model (GLMM) 11 that considers random polygenic effects to increase the power to map QTNs for disease traits.
However, genome-wide GLMM association consumes much more computing time than mixed model association with either restricted maximum likelihood estimation (REML) 12 or Markov Chain Monte Carlo (MCMC) iteration 13 . If using the maximum likelihood in estimation and approximations to avoid numerical integration 14 , the GLMM yields a problem of serious bias induced by the approximations 15 , especially solutions tending toward the positive/negative in nity. In the GWAS for case-control studies, a non-random sampling of cases from the population results in biased estimation for genomic heritability.
Analyzed by the LMM for binary phenotypes, genomic heritability estimate is transformed to a liability scale by adjusting both for scale and for ascertainment of the case samples 16 . The genomic heritability of liability is estimated in a biased manner for disease traits, even though it is done by GLMM via MCMC iteration. Based on the calibrated genomic hereditability for case-control ascertainment, a Chi-squared score statistic for GWAS of disease traits is computed from posterior mean liabilities (PMLs) under the liability-threshold model 17 . The individual PMLs are estimated with a multivariate Gibbs sampler, which increases the computational demand. For simpli cation to GLMM-based association analysis, the GMMAT 18 and SAIGE 19 separately extend the EMMAX 20 and BOLT-LMM 21 , respectively, for normally distributed traits to binary phenotypes.
Owing to the computational intensity and approximate solutions obtained, GLMM can hardly be employed in GWAS for disease traits. Moreover, genomic heritability cannot be accurately estimated for complex diseases, especially in ascertained case-control studies. Motivated by the optimal genomic control for mixed model association analysis for quantitative traits distributed normally, we extended GRAMMAR 22 to handle binary traits by considering genomic breeding values (GBVs) estimated in advance as a known predictor in genomic logit regression, and then, optimized the genomic control for GRAMMAR for binary traits by regulating the downward genomic heritability to estimate the residual phenotypes. The complicated GLMM does not need to be directly solved by the Optim-GRAMMAR for binary traits, and it only repeatedly estimates GBVs with genomic best linear unbiased prediction (GBLUP) 23 for GLMM, achieving genome-wide GLMM association analysis rapidly. Finally, we jointly analyzed the candidate quantitative trait nucleotides (QTNs) chosen by multiple testing to improve the statistical power to detect QTNs.

Statistical properties of Optim-GRAMMAR for binary traits
Based on the two genomic datasets, we simulated phenotypes controlled by 40, 200, and 1,000 QTNs at the low (0.2), moderate (0.5), and high (0.8) genomic heritability, respectively. The statistical properties of Optim-GRAMMAR using a test at once for binary traits were investigated by comparing it with GRAMMAR, GMMAT, LTMLM, and SAIGE. The Q-Q and ROC pro les are displayed selectively in Fig. 1 and Fig. 2, respectively, and in Supplementary Fig. 1S and Fig. 2S, respectively, in detail. The genomic controls are estimated in Table 1S. Making genomic control in nitely close to 1.0, Optim-GRAMMAR achieved almost the same statistical power to detect QTNs as the GMMAT which approximates the exact GLMM, irrespective of how many QTNs and heritabilities are simulated. Among Optim-GRAMMAR and the four competing methods, GRAMMAR had the lowest genomic controls and statistical power, and for GRAMMAR, the population structure was more complex and the false negative rate was larger. Although LTMLM achieved the highest statistical power to detect QTNs for all simulated phenotypes for the maize dataset, and SAIGE demonstrated a higher statistical power for the dataset controlled by 1,000 QTNs at the genomic heritability of 0.2. A strong false positive error rate was observed for SAIGE. In the human dataset, there were no distinct differences in the statistical properties between GRAMMAR-lambda and the four competing methods, although GRAMMAR provided some false negative errors.
After optimization for genomic control, Optim-GRAMMAR jointly analyzed multiple QTN candidates chosen from a test at once at a signi cance level of 0.05. For convenience for comparison, we analyzed the statistical powers obtained with one test at a time and joint analyses together. By backward regression analysis, Optim-GRAMMAR evidently exhibited improved statistical power. In contrast, LTMLM was inferior to joint analysis of Optim-GRAMMAR in the terms of statistical power, even with the highest false positive rates.

Page 4/15
To investigate the effects of sampling markers on Optim-GRAMMAR, we randomly took 3 K, 5 K, 10 K, 20 K, and 25 K SNPs from the entire genomic markers to calculate GRM. Changes in the genomic control at the varied sampling levels of SNPs are depicted in Fig. 3 for Optim-GRAMMAR, GRAMMAR, GMMAT, and SAIGE. Because LTMLM cannot sample SNPs, it was not included in the comparison. No competing method stably controlled the positive/negative false errors using less than 25 K sampling SNPs, besides SAIGE for human phenotypes. Speci cally, GMMAT gradually controlled the positive false errors as the sampling markers increased; GRAMMAR controlled the negative false rate by sampling less markers, while SAIGE produced serious false negative errors in the complex maize population. In comparison, Optim-GRAMMAR still retained a high statistical power to detect QTNs through almost perfect genomic control, even using less than 3000 sampling markers (see Supplementary Fig. 3S and Fig. 4).

Application of Optim-GRAMMAR to WTCCC study
We were authorized to re-analyze the Wellcome trust case-control consortium (WTCCC) study 1 24 . There were the 11,985 cases from six common diseases and 3,004 shared controls, genotyped at a total of 490,032 SNPs. For each dataset, a standard quality control (QC) procedure was performed: SNPs with MAFs < 0.01 and HWE > 0.05 were excluded, and individuals with missing rates > 0.01 were also excluded. For the six common diseases, we implemented Optim-GRAMMAR using entire genomic markers and 5,000 sampling SNPs, respectively, to estimate the GRM. The Q-Q and Manhattan pro les for the six common diseases are depicted in Fig. 4S and Fig. 5S obtained with the Optim-GRAMMAR using a test at once and the four competing methods used in simulations, while in Fig. 5S with the Optim-GRAMMAR using joint association analyses. The association analyses illustrated that (1) under perfect genomic control, Optim-GRAMMAR found the QTNs for each disease on each chromosome, and the numbers of detected QTNs were not less than all the competing methods; and (2) in Optim-GRAMMAR, joint association analyses detected more QTNs than a test at once. As compared to Optim-GRAMMAR, GRAMMAR detected less QTNs with the lowest genomic control among all the methods, while GMMAT yielded more SNPs whose -log(p) exceeded the Bonferroni corrected thresholds for CAD, T1D, T2D, and HT, but it obtained the highest genomic control. Additionally, LTMLM estimated the abnormal genomic heritabilities for CAD, BD, T2D, and HT, producing unstable genomic controls.
Further, we conducted strict QC for each dataset, as done in 16 for estimating genomic heritability. Despite this, the missing heritabilities could not be normally estimated for BD and HT. As shown in Fig. 6S and Fig. 7S, all methods exhibited clear and comparable association results, except for GRAMMAR.
Interestingly, both LTMLM and GMMAT seriously underestimated the genomic heritability for each disease after strict QC. In summary, Optim-GRAMMAR could e ciently and robustly map QTNs for binary diseases and did not depend on estimation of genomic heritability and QC for genomic datasets. For each dataset with standard QC, we recorded the running times from input of genotypes and phenotype to output of mapping QTNs for all the methods. Table 2S shows that Optim-GRAMMAR reduced the computing time by dozens of times with the lowest memory footprint.

Discussion
Development of the GRAMMAR for GLMM association was an essential prerequisite for extending the Optim-GRAMMAR to rapidly optimize mixed model association analysis for binary traits. For genomic GLMM, however, no binary residuals could be produced because of the scale difference between binary phenotype and predictors. Thus, we considered the GBVs estimated in advance as a known predictor in genomic logit regression and then executed association tests for candidate markers. This ensured that GRAMMAR had the lowest computing complexity for association tests for binary traits among the existing GLMM-based association methods [17][18][19] . Because GRAMMAR for binary diseases produces high false negative rates for quantitative traits distributed normally, we optimized the genomic control for GRAMMAR by regulating downward genomic heritability to underestimate the GBVs with GBLUP equations for GLMM. Thus, optim-GRAMMAR solved the GBLUP equations and performed association tests with simple logit regression only for several iterations, thus improving the computational e ciency for genome-wide GLMM association analysis.
Several GLMM-based association methods such LTMLM, GMMAT, and SAIGE have simpli ed genomewide mixed model association analysis for binary traits to a certain extent, but they are more appropriate to handle the less complex populations such as human datasets 18,19 . Moreover, the heritability for binary diseases could not be robustly and precisely estimated using genomic markers 16 14,15 , which also limited the e cient application of these association methods. In contrast, because optim-GRAMMAR does not need to directly estimate genomic heritability, it can powerfully and robustly map QTNs for binary traits in complexly structured populations. Within the framework of Optim-GRAMMAR, further, joint analysis for the candidate quantitative trait nucleotides chosen by multiple testing signi cantly improved statistical power to detect QTNs with almost perfect genomic control.

Online Methods
Genomic logit regression Complex disease traits, as binary ones, usually follow binomial or Poisson distributions, so the generalized linear model (GLM) 1,2 is used to map QTLs controlling the traits. Assume that n individuals are recorded for phenotypic values and genotyped for m genetic markers. Distinguishing these markers from major and common alleles in a magnitude of effects of the markers on quantitative traits, we describe the relationship between all markers (predictors) and the mean of the exponential distribution family in the following logit regression: where µ denotes the expectations of phenotypic distribution, b is the systematic environment effect; the population structure (strati cation) which results in phenotypic differences among subpopulations is always considered here, except for sex, age, and some initial experimental conditions. a 1 is the large genetic effect of q markers on phenotype, a 2 is the minor or zero effect of the m-q markers on phenotype, and X, Z 1 , and Z 2 are the corresponding design matrices of b, a 1 , and a 2 , respectively.

GRAMMAR for binary disease traits
We de ne the GBVs as where y is a binary phenotype, and h 2 = σ 2 g σ 2 g + 1 is the unknown genomic heritability of liability with the residual variance of 1 to be assumed in GLMM, which can be estimated in advance using the REML for GLMM 3,5 .
Unlike the normally distributed quantitative traits, the residuals for binary traits cannot be directly obtained due to the difference in scale between the predictors and response variables. Within the framework of GRAMMAR, thus, we eliminated polygenic effects on the binary phenotype by regarding the estimated GBVs ĝ as a known predictor in the following GLM: ln μ 1 − μ = z SNP a SNP +ĝ 5 with a regression item z SNP a SNP of the SNP tested.
With the iteratively re-weighted least square method 1 , we obtained the maximum likelihood estimate for the SNP effect as:â The test statistic to infer the association of the SNP with binary traits is generally formulated by which is subject to the chi-squared distribution with the 1 degree of freedom.

Optimal Genomic control
In GRAMMAR for binary traits, replacement of polygenic effects excluding QTNs with GBVs de ates the test association statistics, which yields a high false negative rate. By regulating the downward genomic heritability, we can more accurately estimate the polygenic effects with the GBLUP Eq. (4). The polygenic heritability less than genomic heritability is determined by optimizing genomic control for association tests. Such an optimization for GRAMMAR can be summarized in the following steps:  Comparison in the ROC pro les between Optim-GRAMMAR and the four competing methods. The ROC pro les are plotted using the statistical powers to detect QTNs relative to the given series of Type I errors.
Here, the simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The ROC pro les for all simulated phenotypes are reported in Supplementary Figure  2S.