A Shortcut Approach for Large-scale Mixed Model Associations with Binary Traits

Generalized linear mixed models exhibit computationally intensive and biasness in mapping quantitative trait nucleotides for binary diseases. In genomic logit regression, we consider genomic breeding values estimated in advance as a known predictor, and then correct the deated association test statistics by using genomic control, thereby successfully extending GRAMMAR-Lambda to analyze binary diseases in a complex structured population. Because there is no need to estimate genomic heritability and genomic breeding values can be estimated by a small number of sampling markers, the generalized mixed-model association analysis has been extremely simplied to handle large-scale data. With almost perfect genomic control, joint analysis for the candidate quantitative trait nucleotides chosen by multiple testing offered a signicant improvement in statistical power.


Introduction
Complex diseases are generally thought to be the quantitative traits controlled by a number of loci each having a small effect 1,2 . Conventionally, logistic regression in generalized linear models (GLMs) 3,4 instead of linear regression models was used to map quantitative trait nucleotides (QTNs) for binary phenotypes. Although logistic regression can correct for xed-effect covariates [5][6][7] , it still leads to in ation of associated test statistics. Therefore, the generalized linear mixed-model (GLMM) 8 which considers random polygenic effects to increase the power to map QTNs for disease traits was introduced. However, the computation for genome-wide GLMM association is much more intensive than that for mixed-model association with either restricted maximum likelihood estimation (REML) 9 or Markov chain Monte Carlo (MCMC) iterations 10 . If using the maximum likelihood in estimation and approximations to avoid numerical integration, the GLMM produces a problem of serious bias induced by the approximations 11 , especially solutions tending toward positive/negative in nity.
In ascertained case-control studies in which the proportions of cases and controls are not a random sample from the population, GLMM provides biased estimates of the genomic heritability for disease traits 12 and therefore suffers a loss in the power to detect QTNs. Based on the calibrated genomic hereditability for case-control ascertainment, a Chi-squared score statistic for association tests was computed from the posterior mean liabilities under the liability-threshold model 13 . For simpli cation of GLMM-based association analysis, GMMAT 14 and SAIGE 15 separately extend the EMMAX 16 and BOLT-LMM 17 for normally distributed traits to binary diseases. Dividing the de ated test statistics from the GRAMMAR by genomic control, GRAMMAR-lambda was developed for extremely e cient genome-wide mixed-model association analysis (Not published yet). In this study, we extend the GRAMMAR-Lambda for quantitative traits normally distributed to handle binary diseases by regarding genomic breeding values (GBVs) estimated in advance as a known predictor in genomic logit regression and then calibrating the test statistics by genomic control. Because genomic heritability does not need be estimated and GBVs can be obtained by a small number of sampling markers, GLMM-based association analysis is extremely simpli ed. Moreover, joint analysis for the QTN candidates chosen by multiple testing signi cantly improves the statistical power to detect QTNs for binary diseases.

Statistical properties of GRAMMAR-lambda for binary diseases
For the two genomic datasets, phenotypes were simulated to be controlled by 40, 200, and 1000 QTNs at the low (0.2), moderate (0.5), and high (0.8) genomic heritability, respectively. The GRAMMAR-Lambda, a test at once is comparable with the four competing methods-GRAMMAR, GMMAT, LTMLM, and SAIGE. Association results are displayed selectively in Figure 1 for the Q-Q pro les and Figure 2 for ROC pro les ( Figure 1S and Figure 2S for details); the genomic controls are estimated in Table 1S. Under genomic control of exact 1, GRAMMAR-Lambda performs stably and with high statistical power, irrespective of how much the QTNs control the quantitative traits and how complex the population structures are. In contrast, GMMAT exhibits the same statistical power as GRAMMAR-Lambda, but with slightly low, even instable genomic control. For GRAMMAR, genomic controls are the lowest among GRAMMAR-Lambda and the four competing methods, the population structure is more complex, and the false negative rate is higher. Although LTMLM detected most QTNs for all simulated phenotypes in maize and SAIGE provided higher statistical power for those controlled by 1,000 QTNs at the genomic heritability of 0.2, they produce strong false positive errors. For humans, there were no distinct differences in statistical properties of GRAMMAR-Lambda and the four competing methods, although GRAMMAR produced a slight false negative error.
Furthermore, GRAMMAR-Lambda could jointly analyze multiple QTN candidates chosen from a test at once at a signi cance level of 0.05. For convenience of comparison, statistical powers of detecting QTNs are depicted together in Figure 1 selectively and in Supplementary Figure 1S in detail. Through backward multiple regression analysis, GRAMMAR-Lambda offered signi cant improved statistical power, but also kept almost perfect genomic control that was in nitely close to 1. Even at the highest false positive rates, LTMLM was inferior to GRAMMAR-Lambda with joint analysis in terms of statistical power.

Sensitivity to estimate genomic heritability
For normally distributed traits, GRAMMAR-Lambda does not need to estimate genomic heritability. To test whether this nding ts the binary diseases or not, we pre-specify the genomic heritability at 0.5 to analyze the simulated phenotypes controlled by 200 QTNs at different levels of heritabilities (see Figure   3S). Figure 3 compares the statistical behaviors of GRAMMAR-Lambda and the other methods by specifying and estimating genomic heritability. As shown in each plot, both QQ and ROC curves obtained by pre-specifying the genomic heritability almost overlapped with those obtained by estimating genomic heritability. Extensive simulations in optimizing GRAMMAR showed that the lower bound of the given heritability may be lower for binary diseases than that for normally distributed ones (data not shown).
This supports the fact that while implementing GRAMMAR-Lambda for binary diseases, genomic heritability is set to 0.5 by default or to the empirical heritability of traits, if available.

Calculation of GRMs with the sampling markers
When the genomic heritability is pre-speci ed to 0.5, we randomly took 3 K, 5 K, 10 K, 20 K, and 25 K SNPs from the entire genomic markers to analyze the simulated phenotypes controlled by the varied numbers of QTNs at the heritability 0.5 using all methods, except LTMLM which cannot sample SNPs to estimate heritability. Changes in genomic control at the varied sampling levels of SNPs are depicted in Figure 4 for GRAMMAR-Lambda, GRAMMAR, GMMAT, and SAIGE. No competing method can stably control the positive/negative false errors using less than 20 K sampling SNPs, and SAIGE for human phenotypes simulated. Speci cally, GMMAT gradually controls the positive false errors when the sampling markers increased; GRAMMAR seemed to reduce the negative false rate by sampling less markers, while SAIGE produced a serious false negative error in the complex maize population. In comparison, GRAMMAR-Lambda still retained high statistical power to detect QTNs through perfect genomic control, even by using less than 3000 sampling markers (see Figure 4S and Figure 5S).

Application of GRAMMAR-Lambda to WTCCC study
We were authorized to collect the data of 11,985 cases of six common diseases and 3,004 shared controls, genotyped at a total of 490,032 SNPs from the Wellcome Trust Case Control Consortium For the six common diseases, we implemented GRAMMAR-Lambda in two ways: to estimate the genomic heritability and GBVs together using all genomic markers and to estimate only GBVs by randomly sampling 5,000 SNPs with a given heritability of 0.5. The Q-Q and Manhattan pro les for the six common diseases are shown in Figure 6S and Figure 7S, which depict GRAMMAR-Lambda and the four competing methods used in simulations. We concluded that (1) under perfect genomic control, GRAMMAR-Lambda found the QTNs for each disease on each chromosome, and the numbers of the detected QTNs were not less than all the competing methods; and (2) in GRAMMAR-Lambda, joint association analyses detected more QTNs than a test at once. Compared with GRAMMAR-Lambda, GRAMMAR detected less QTNs with the lowest genomic control among all methods, while GMMAT yielded more SNPs whose -log(p) exceeded the Bonferroni corrected thresholds for CAD, T1D, T2D, and HT, but it behaved the largest genomic control. Additionally, LTMLM estimated the abnormal genomic heritabilities for CAD, BD, T2D, and HT, producing an unstable genomic control.
For better estimating the genomic heritability, furthermore, we conducted strict QC for each dataset, as performed previously in 12 . Despite this, the missing heritabilities could not be normally evaluated for BD and HT. As shown in Figure 8S and Figure 9S, all methods provided clear and comparable association results, except for GRAMMAR. Interestingly, both LTMLM and GMMAT seriously underestimated the genomic heritability for each disease after strict QC. GRAMMAR-lambda could extremely e ciently and robustly map QTNs for binary diseases and did not depend on the estimation of genomic heritability and QC for genomic datasets. For each dataset with standard QC, we recorded the running times from input of genotypes and phenotype to the output of mapping QTNs for all the methods. Table 2S shows that GRAMMAR-Lambda several times to dozens of times reduced the computing time by almost dozens of times with the lowest memory footprint.

Discussion
Although interpretable and predictable for discrete traits, the GLMM cannot be e ciently applied into GWAS for complex diseases because of intractable solutions and computation. Several GLMM-based association methods such as LTMLM, GMMAT, and SAIGE have simpli ed the genome-wide mixed-model association analysis for binary diseases to certain extent, but they are more likely to appropriately handle less complex populations like humans 14,15 . In this study, we successfully extended GRAMMAR-Lambda from normally distributed traits to binary diseases, extremely simplifying the generalized mixed-model association analysis. In complex structured populations, the extended GRAMMAR-Lambda can more e ciently and robustly map QTNs for binary diseases than the existing GLMM-based association methods. With almost perfect genomic control, joint analysis for the candidate QTNs chosen by multiple testing signi cantly improved the statistical power, under the framework of GRAMMAR-Lambda.
Prior to applying GRAMMAR-Lambda for binary diseases, GRAMMAR needs to be constructed to rapidly associate binary phenotypes with candidate markers. Within the framework of GLM, however, no binary residuals could be produced because of the difference in scale between the binary phenotype and predictors. Therefore, we took the GBVs estimated in advance as a known predictor in genomic logit regression and then executed association tests for candidate markers. Moreover, the heritability for binary diseases could not be robustly and precisely estimated using genomic markers 9,11,12 , which also limits the e cient application of the existing GLMM-based association methods. Inheriting the advantages for the normally distributed traits, GRAMMAR-Lambda extremely e ciently performed genome-wide GLMM association analysis in three ways: 1) the genomic heritability was not required to be estimated for binary diseases; 2) it used fewer sampling markers to calculate the GRM; and 3) the computing complexity of association tests was the lowest for binary phenotypes, as that of the PLINK 19 .
Generally, computing costs are attributed to building the GRM, estimating variance components or polygenic heritability, and computing association statistics in genome-wide mixed-model association analysis 20 . To date, no algorithm has been developed that can comprehensively reduce these three computational charges. For a genomic dataset containing m SNPs genotyped on n individuals, GRAMMAR-Lambda for binary diseases took only the computing complexity of O(mn 2 ) to build the relationship matrix and O(mn) for association tests. When analyzing a large-scale population, we solved the effects of m 0 of the sampled markers using ridge regression 21 with given heritability and then estimated GBVs as Z 0 a 0, which reduced the computing time to build the information matrix to , as in FaST-LMM-Select 22 . For the simulated 8 million SNPs on 400,000 individuals, GRAMMAR-Lambda required only 4.7hr to analyze single binary phenotype by sampling 5,000 SNPs to calculate GRM, while SAIGE did about 534 hr 15 . A user friendly GRL-Binary software was developed, which is freely available at https://github.com/RunKingProgram/ BinaryGRAMMAR-Lambda. Figure 1 Comparison in the Q-Q pro les between GRAMMAR-Lambda and the four competing methods. The simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The Q-Q pro les for all simulated phenotypes are reported in Supplementary Figure 1S. Comparison in the ROC pro les between GRAMMAR-Lambda and the four competing methods. The ROC pro les are plotted using the statistical powers to detect QTNs relative to the given series of Type I errors.

Declarations
Here, the simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The ROC pro les for all simulated phenotypes are reported in Supplementary Figure   2S. Sensitivity of statistical powers to the speci ed heritabilities for GRAMMAR-Lambda. Statistical powers are dynamically evaluated with the ROC pro les. The simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize.