A generalized linear mixed model association tool for biobank-scale data

Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case–control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits. FastGWA-GLMM is a fast, scalable generalized linear mixed model method for genetic association testing for binary traits in large cohorts that is robust to variant frequency and case–control imbalance.

manipulation of full-dense n × n matrices (although not explicitly computed), with n being the sample size, which is both time-and resource-consuming.
In our previous work, we developed an LMM-based GWA tool, fastGWA, that is orders of magnitude faster than other LMM-based tools, mainly owing to the use of a sparse genomic relationship matrix (GRM) to capture pedigree relatedness among individuals 10 . However, when we applied fastGWA to the UKB data, we had to remove ~3 million variants with MAF ≤ 0.01 and ~1,000 binary traits with case-control ratios <1:99 to avoid the inflation in FPR mentioned above 10 . In the present study, we aimed to develop a GWA tool that is scalable to GWAS data with millions of individuals and applicable to both common and rare variants for all binary phenotypes, including those with highly unbalanced case-control ratios. To achieve this goal, we incorporated GLMM into the fastGWA framework and developed efficient sparse matrix-based algorithms for parameter estimation and association testing. We name the method fastGWA-GLMM (or fastGWA-B, to distinguish it from the original fastGWA method 10 ) and have implemented it in the GCTA software package 15 . We demonstrate by simulation that the test statistics from fastGWA-GLMM are not inflated for either common or rare variants, even if the case-control ratio is extremely low (for example, 0.1%). We then show by analyzing subsets of the UKB data that fastGWA-GLMM is severalfold to orders of magnitude faster than the state-of-the-art tools (for example, when n = 400,000, fastGWA-GLMM is 36.8-fold faster than SAIGE using 8 CPU (central processing unit) threads). We further demonstrate the scalability of fastGWA-GLMM to GWAS data beyond the size of the UKB, based on a simulated dataset with two million individuals. We have applied fastGWA-GLMM to the UKB data for 2,989 binary traits and made the full summary statistics publicly accessible at the fastGWA data portal (http://fastgwa.info/ukbimpbin). logit (μ) = xs β s + Xc β c + g where y is an n × 1 vector of binary phenotypes, μ is a vector of μ i = P(y i = 1|x si , X ci , g i ) with μ i being the probability of subject i being a case given the subject's genotype x si , covariates X ci and random genetic effect g i ; x s is a vector of genotype variables of a variant of interest with its effect β s ; X c is the incidence matrix of fixed-effect covariates (for example, sex, age and principal components (PCs)) with their corresponding coefficients β c ; g is a vector of effects that capture genetic and common environmental effects shared among related individuals, g ∼ N ( 0, π σ 2 g ) with π being the pedigree relationship matrix and σ 2 g being the corresponding variance component. In practice, if pedigree information is unavailable or incomplete, π can be replaced by the sparse GRM, that is, the GRM with all the small off-diagonal elements (for example, those <0.05) set to zero 10 .
The fastGWA-GLMM method comprises two steps: (1) the parameter estimation step: estimating σ 2 g , β c and the other parameters under the null model, that is, logit (μ) = Xc β c + g; and (2) the association test step: performing score test for each variant and, if necessary, applying saddle point approximation (SPA) to correct for potential inflation driven by case-control imbalance (Methods). In the parameter estimation step, utilizing the advantage of the sparse matrix setting, we have developed a grid search-based algorithm (named fastGWA-B-REML; see Methods) that is highly computationally efficient and robust in estimating the variance components, even for traits with an extremely unbalanced case-control ratio (for example, 0.1%). In the association test step, based on the estimates obtained from the step above, the score test statistic for each variant can be computed by the following equation: Tscore = x T s (y −μ) with var (Tscore) = x T s Pxs and W being a diagonal matrix, that is, W ii = μ i (1 − μ i ) (Methods). P is an n × n projection matrix, which is dense despite π being sparse (Methods). Therefore, to avoid the computational bottleneck due to matrix multiplication involving P, a GLMM version of the GRAMMAR-GAMMA approximation method 14 is implemented in fastGWA-GLMM (Methods).
To avoid an elevated FPR in GWAS discovery due to case-control imbalance, the SPA correction is applied to variants with χ 2 test statistics ≥4. To further improve computational efficiency, we have developed an approximate approach to account for covariates for variants with χ 2 test statistics <4 (Methods). An alternative version without using this approximation is also available, which is a fewfold less efficient depending on the number of covariates fitted (Methods). We have shown that the test statistics from the approximate approach were highly consistent with those from the exact approach ( Supplementary Fig. 1). In addition, to test the aggregate effect of multiple variants (for example, all the rare variants in or around a gene), we have implemented in GCTA two gene-based methods, the fastGWA-GLMM burden test (named fastGWA-BB) and the Aggregated Cauchy Association Test, based on variant-level P values (ACAT-V) 16

(Methods).
Runtime and resource requirements. We used the UKB data, consisting of 456,348 individuals of European ancestry and 11,842,647 variants (Methods), to evaluate the computational performance of fastGWA-GLMM in GCTA v.1.93.3 in comparison with SAIGE v.0.42.1. Note that the standard logistic regression (as implemented in PLINK2 v.2.00a2.3) was not included in the comparison because it is >100-fold slower than fastGWA-GLMM and not applicable to some of our simulation settings. After randomly sampling subgroups of individuals (ranging from 50,000 to 400,000) from the UKB, we performed a GWA in each subset of data using fastGWA-GLMM and SAIGE, respectively, on a computing platform with 80 GB of memory and 8 CPUs. The trait used for comparison is 'irritability' (case-control ratio = 0.39; UKB datafield: 1940). The genotype data were stored in BGEN v.1.3 format 17 . Each test was repeated five times for an average of runtime and peak memory usage. We observed a linear relationship of either the runtime or the memory usage of fastGWA-GLMM to sample size (n) (Fig. 1). For a GWA with n = 400,000, fastGWA-GLMM required only 4.9 h, which is 36.8-fold faster than SAIGE ( Fig. 1a and Table 2). Both tools showed very efficient memory usage (Supplementary Table 3), with fastGWA-GLMM showing a better scalability (Fig. 1b). During the revision of our manuscript, we have also included in the comparison a newer version of SAIGE (v.0.44.5), which was slower than SAIGE v.0.42.1 (Supplementary Tables 1 and 2). Hence, we report the main results based on SAIGE v.0.42.1.
To assess the scalability of fastGWA-GLMM to GWAS data beyond the size of the UKB, we generated a pseudo-GWAS dataset with 2 million individuals and 11,842,647 variants, by duplicating the UKB genotype data multiple times (Supplementary Note). We used a threshold-liability model to simulate a binary disease trait and ran fastGWA-GLMM with the full set (n = 2 million) or a subset of the data (n ranging from 200,000 to 1.8 million) (Supplementary Note). The result showed that the relationship between the runtime (or memory usage) of fastGWA-GLMM and n was still linear when n was >400,000 (Fig. 2 and Supplementary Tables 4 and 5), suggesting the scalability of fastGWA-GLMM to data with millions of individuals. For instance, when using 16 CPUs, fastGWA-GLMM was easily scalable to data with n = 1 million (runtime = 8.3 h and memory usage = 14.5 GB), and it took less than a day (runtime = 17.0 h and memory usage = 31.6 GB) to run a GWA with n = 2 million.
While our manuscript was under revision, a machine learning method (called REGENIE) that implements whole-genome regression of multiple traits was published 18 . REGENIE and fastGWA-GLMM showed highly similar performances in both simulation study (Extended Data Figs. [2][3][4] and real data analysis ( Supplementary Figs. 2 and 3). We benchmarked the computational performance of REGENIE (v.2.0.2) using the same data (n = 400,000) and on the same platform as we used to benchmark fastGWA-GLMM and SAIGE in three scenarios: a single trait, 8 traits (the set of traits included in our real data analysis) and 50 traits (the number used in the REGENIE paper 18 ). The runtime of REGENIE in the single trait analysis was 27.5 h, 5.6-fold slower than that of fastGWA-GLMM (4.9 h) (Supplementary Table 6). When analyzing 8 traits jointly, the runtime of REGENIE was 97.2 h, corresponding to 12.2 h per trait, which was still 2.5-fold slower than fastGWA-GLMM (Supplementary Table 6 Table 6). Hence, in our benchmark analysis, the runtime of fastGWA-GLMM per trait was 2.5-to 5.6-fold faster than that of REGENIE, depending on the number of traits analyzed jointly in REGENIE. Moreover, grouping phenotypes for a joint analysis increases the time per run, which becomes more challenging when applied to larger datasets, as demonstrated in the scenario below. When we applied REGENIE to the pseudo-cohort of 2 million individuals, on average fastGWA-GLMM was at least 4.0-fold faster than REGENIE per trait, and REGENIE could not finish the job for 50 traits because it exceeded the 4 week time limit (Supplementary Table 7). It is of note that, when using a single CPU thread, the association test step of fastGWA-GLMM was 13.0-fold faster than that of REGENIE and 7.2-fold faster than that of SAIGE in the analysis with n = 400,000 (Supplementary Table 8).
FPR and statistical power. To quantify the statistical performance of fastGWA-GLMM in comparison with other tools, including SAIGE 14 and PLINK2 (logistic regression using all individuals or unrelated individuals, denoted as LR-All and LR-unRel, respectively) 7 , we generated a sample of 100,000 simulated individuals with substantial population stratification and relatedness from a subset of the real UKB genotype data (Methods). Based on the simulated genotype data, we randomly sampled a number of causal variants from all variants on the odd chromosomes to simulate phenotypes, leaving the variants on the even chromosomes as the null variants to quantify the type 1 error rate. We also introduced common environmental effects (that is, nongenetic effects shared among close relatives) and population stratification effects to the phenotype (Methods). Finally, using the simulated data, we quantified the FPR (that is, the proportion of null variants with P values less than a threshold) and statistical power (measured by the mean χ 2 statistic at the causal variants) for different association methods.
When the disease prevalence was >0.05, the FPRs of the null variants at 5 different P value thresholds (α = 0.05, 0.005, 5 × 10 −4 , 5 × 10 −5 and 5 × 10 −6 ) were largely consistent with the expected values for fastGWA-GLMM, SAIGE and LR-unRel, but inflated for LR-All, because relatedness was not accounted for in LR-All ( Fig. 3 and Extended Data Fig. 5). When the prevalence was ≤0.01, both LR-unRel and LR-All showed inflated FPRs, whereas such inflation was not observed for SAIGE and fastGWA-GLMM. The FPR of  SAIGE was slightly more deflated than that of fastGWA-GLMM in all the simulation scenarios ( Fig. 3 and Extended Data Fig. 5), probably owing to proximal contamination because of the use a dense GRM in SAIGE (see below for more discussion). In particular, in the scenario with prevalence = 0.005, the FPRs of SAIGE were more deflated than those of all the other methods because the parameter estimation process of SAIGE failed to converge in ~25% of the simulation replicates. We partitioned all the null variants into two groups (common and rare variants) based on an MAF threshold of 0.01 and evaluated the FPRs of the two groups separately in each simulation scenario. The FPRs for rare variants (MAF < 0.01) from LR-unRel and LR-All were substantially inflated in the scenarios with low prevalence levels, whereas those from fastGWA-GLMM remained consistent with the expected values for both common and rare variants regardless of the prevalence level (Supplementary Figs. 4 and 5). SAIGE showed similar performance to fastGWA-GLMM, except that it showed more deflated FPRs than all the other methods when prevalence = 0.005, due to its convergence issue as described above.
Next, we quantified statistical power of the methods by the mean χ 2 statistic at the causal variants (χ 2 causal ). The χ 2 causal was similar between fastGWA-GLMM and SAIGE, although the χ 2 causal from fastGWA-GLMM was slightly higher than that from SAIGE for common variants ( Supplementary Fig. 6). The χ 2 causal from LR-All or LR-unRel was noninformative in this case because it suffered from inflation due to relatedness and/or case-control imbalance. We then used the area under the curve (AUC) to compare power between the methods at the same FPR level (α = 1, 0.05, 5 × 10 −3 , 5 × 10 −4 or 5 × 10 −5 ; see Methods). Although LR-unRel tended to show lower AUCs because of its smaller sample size, SAIGE, fastGWA-GLMM and LR-All showed similar AUCs in almost all the scenarios (Supplementary Figs. 7 and 8), suggesting similar levels of power between the three methods when evaluated at the same FPR level. In addition, the test statistics of fastGWA-GLMM remained well calibrated when cases were oversampled (Supplementary Figs. 9-11). When pedigree information was fully available, fastGWA-GLMM using a pedigree relationship matrix performed almost equally well to that using the sparse GRM (Supplementary Note and Extended Data Figs. 6 and 7).
We have also used the simulation to benchmark gene-based test methods including fastGWA-BB and ACAT-V in GCTA and REGENIE-Burden (note that we did not manage to run SAIGE-Burden 19 successfully). All the three methods showed well-calibrated FPRs under the null model (Extended Data Fig.  8). To quantify the power under the alternative model, we varied the proportion of variants being causal in a gene (5%, 20% or 50%) and the directions of variant effects (random versus consistent). When the directions of the effects of the rare alleles were simulated at random, ACAT-V was always more powerful than fastGWA-BB and REGENIE-Burden, regardless of the proportion of variants being causal (Extended Data Figs. 9a-c and 10a-c). In the scenario where the effects of all the rare alleles in a gene were in a consistent direction, ACAT-V was still more powerful than fastGWA-BB and REGENIE-Burden when the proportion of variants being causal was relatively small (that is, 5%), but had lower power than fastGWA-BB and REGENIE-Burden when the proportion of variants being causal was large (that is, 20% and 50%) (Extended Data Figs. 9d-f and 10d-f). In all the simulation scenarios, we observed an extremely strong correlation (r > 0.98) between the test statistics of fastGWA-BB and REGENIE-Burden (Supplementary Table 9).
Applying fastGWA-GLMM to 2,989 binary traits in the UKB. We have applied fastGWA-GLMM to real data analysis of 2,989 binary phenotypes in the UKB with 456,348 individuals of European ancestry and 11,842,647 imputed variants. The phenotypes were generated using the pipeline from the Neale Lab (http://www.nealelab. is/uk-biobank) or our in-house pipeline based on the International Classification of Disease (ICD) to Phecode maps from Wu et al. 20 (Methods). All the summary statistics are available for visualization and downloading through the fastGWA data portal (http://fastgwa. info/ukbimpbin) 10 . To benchmark fastGWA-GLMM against SAIGE and PLINK2 LR-unRel (note: n = 348,456 for LR-unRel), we selected, from the 2,989 traits, 8 representative phenotypes, with prevalence ranging from 0.0008 to 0.45 (Supplementary Table 10 and Supplementary Figs. 2 and 3). After clumping, the number of quasi-independent signals from fastGWA-GLMM was larger than that from either SAIGE or LR-unRel (Supplementary Table 11). The signals missed by SAIGE were probably due to proximal contamination because most of them could be identified by SAIGE-LOCO, that is, SAIGE with the leave-one-chromosome-out (LOCO) option ( Supplementary Fig. 12). As for case-control imbalance, the results from LR-unRel started to exhibit inflation for traits with prevalence <0.01 (see the third panels of Supplementary Fig. 2e-h), consistent with our simulation results, and the inflation was more prominent for the rare variants (see the third panels of Supplementary  Fig. 3e-h). Meanwhile, the results from fastGWA-GLMM and SAIGE remained robust for traits with low prevalence. Among all the 2,989 traits analyzed, we identified 326 pairs of quasi-independent associations between 259 rare variants (MAF < 0.01 and P ≤ 5 × 10 −9 ) and 75 traits (Supplementary Table 12; see Methods). Of the 259 rare variants, 37 are located in either the exonic regions or the 3′or 5′-UTRs (Supplementary Table 12), highlighting the enrichment of rare variants in the coding and UTR regions (enrichment P = 9.6 × 10 −5 ; Supplementary Note).
We have applied ACAT-V and fastGWA-BB to the eight representative traits. The results were in line with our observations from simulations that ACAT-V and fastGWA-BB performed differently under different patterns of genetic architecture (Supplementary Fig. 13 and Supplementary Table 13). We identified a total of 22 genes for 3 traits, with 4 genes in common across the methods, IL4R, PBX2, NXPH4 and STAC3, all associated with asthma. IL4R was a well-known gene related to asthma 21,22 . PBX2 was also a previously established asthma-associated gene 23 . No prior evidence has been found for NXPH4 and STAC3.
We have also benchmarked the performance of fastGWA-GLMM and SAIGE in disease risk prediction by a fivefold crossvalidation using two polygenic risk score methods: the conventional pruning + thresholding (P + T) method and the more recently developed SBayesR method 24 . Overall, across five UKB traits, there was a difference in prediction accuracy between the polygenic risk score methods, with SBayesR performing consistently better than P + T, in line with prior work 24,25 , but the difference between SAIGE and fastGWA-GLMM was negligible (Supplementary Table 14).

Discussion
In the present study, we have developed a highly computationally efficient association method, fastGWA-GLMM, for GWA analyses of binary phenotypes in large cohorts such as the UKB. Tested in a dataset with 400,000 individuals and 11,842,647 variants, fastGWA-GLMM was severalfold to orders of magnitude more efficient than the existing methods, depending on sample size, the number of traits analyzed and the number of CPUs used for each analysis job. We also demonstrated the scalability of fastGWA-GLMM to GWAS data with two million individuals. The implementation of the GLMM framework allows users to retain the maximum number of individuals in a GWA analysis in the presence of relatedness, and the incorporation of SPA correction properly calibrates the test statistics for traits with extreme case-control ratios. The application of fastGWA-GLMM to 2,989 binary traits in the UKB further demonstrated its utility and efficiency.
The major advantage of fastGWA-GLMM over LR-unRel is that it does not need to remove related individuals from the study, because the relatedness can be well accounted for by a pedigree relatedness matrix or a sparse GRM. Taking the real data application in the UKB as an example, fastGWA-GLMM was able to include all 456,348 participants in the association analysis, whereas LR-unRel could utilize only information from 348,456 unrelated participants. As most of the large population-based cohorts rely on an assessment center-based recruitment strategy, the proportion of relatives in the cohorts tends to be high and will keep increasing in the future 1 . In such cases, it is crucial to avoid removing data of related individuals. Another advantage of fastGWA-GLMM over LR-unRel is its efficiency. As with many other GLMM-based methods, fastGWA-GLMM uses a score statistic for association test, which is relatively easy to compute (Methods). In contrast, LR-unRel, such as that implemented in PLINK2, uses Wald's test based on an iteratively reweighted least squares method, which requires repeated solution of the full model for each variant, and hence is much slower than the score test, especially for analyses with covariates.
When applied to binary traits, the advantages of fastGWA-GLMM over LMM-based methods (for example, BOLT-LMM 9 and fast-GWA 10 ) can be summarized into two aspects. The first is the better interpretability of the effect sizes, because we can directly use natural logarithm to convert the β s from fastGWA-GLMM into an odds ratio (Supplementary Note). However, such transformation in LMM-based methods is indirect and requires sophisticated approximations 26 . The second aspect is the better-controlled FPR of fastGWA-GLMM by the SPA correction. As the SPA correction is not applicable when applying LMM-based methods to binary traits 27 , a common strategy is to exclude traits with low case-control ratios (for example, ≤1:99) and variants with low MAFs (for example, <0.01) 9,10 , resulting in substantial loss of valuable information. An alternative strategy is to down-sample controls, which performed reasonably well for common variants (Supplementary Fig. 14). For rare variants, however, although this strategy could reduce the inflation in the LMM test statistics, the remaining inflation was sufficiently large to generate false-positive associations ( Supplementary  Fig. 15). In comparison, by using fastGWA-GLMM, we obtained well-calibrated summary statistics for all 3,821,959 rare variants, among which we identified hundreds of variants associated with the traits at a very stringent significance level, including known associations (Supplementary Table 12). For example, we identified a rare missense variant in HOXB13, rs138213197, strongly associated with prostate cancer, and this association has been reported in previous studies [28][29][30] .
SAIGE is a GLMM-based method that uses a dense GRM. Apart from the GRM setting, there are three other major differences between fastGWA-GLMM and SAIGE. First, fastGWA-GLMM uses a grid search-based algorithm, fastGWA-B-REML, to estimate the variance components (Methods), which is more robust and often orders of magnitude more efficient than the average information REML algorithm used in SAIGE, even for traits with extreme case-control ratios. In our simulation with prevalence = 0.005, the variance estimation step of SAIGE failed to converge for 26 of 100 simulation replicates. Second, SAIGE may suffer from proximal contamination because of the use of a dense GRM 31 . This is demonstrated in our real data analysis that there were several signals missed by SAIGE for three traits with relatively high case-control ratios, that is, mood swings, irritability and asthma (Supplementary Figs. 12a-c), but most of them can be restored by SAIGE-LOCO ( Supplementary Fig. 12a-c and Supplementary Table 11). Third, instead of using covariate-adjusted genotype data to calculate a score test statistic for every variant, fastGWA-GLMM first uses unadjusted (but mean-centered) genotype data to calculate an approximate score test statistic, and then re-calculate the exact test statistic using the covariate-adjusted genotype data for variants with χ 2 test statistics ≥4 (note: the SPA correction is also applied when this threshold is met; see Methods). This strategy allows fastGWA-GLMM to omit the computation of matrix multiplication between the covariate matrix and ~95% of the genotype vectors. We have confirmed that the difference of test statistics between the approximate covariate-adjustment approach and the exact approach is negligible (Supplementary Fig. 1). Only variants with χ 2 test statistics <4 might suffer from slight deflation, which does affect the power of detecting association at a genome-wide significance level. This strategy is particularly useful when the number of covariates is large (for example, >20). In our software tool, there is an option to allow users to switch off this approximation and force all the statistics to be calculated by the covariate-adjusted genotypes, which will cause a loss of computational efficiency by a fewfold, depending on the number of covariates fitted.
There are a few caveats when applying fastGWA-GLMM in practice. First, if pedigree data are not usable, a sparse GRM needs to be pre-computed from the SNP data. A very efficient parallelized algorithm has been implemented in GCTA to compute the sparse GRM 10 . As the sparse GRM setting has already been adopted by fastGWA 10 , once generated, the same sparse GRM of a cohort can be used for GWA analyses of all the quantitative and binary phenotypes. Therefore, the average computational cost per trait is minimal. Second, the σ 2 g estimated from fastGWA-B-REML cannot be interpreted as genetic variance or heritability. This is mainly due to the use of the penalized quasi-likelihood and Laplace's method 14 . However, from our simulations and real data applications, it did not affect the statistical performance of the association test of fastGWA-GLMM. Third, as noted in prior work 14 and also demonstrated in the present study (for example, Extended Data Fig. 5), the SPA correction can lead to deflation in test statistics under the null. However, we believe that the SPA correction is still necessary, because otherwise the FPR will be highly inflated for rare phenotypes (Supplementary Figs. 16  and 17). Note that the FPR of SAIGE without SPA was still deflated even when the case-control ratio was not extreme, possibly because of the proximal contamination noted above. Fourth, our previous work shows that the σ 2 g estimated based on a sparse GRM is a better quantity to control for relatedness than that based on a dense GRM in quantitative trait analysis 10 , which is not, however, the case for binary traits. The FPR of either fastGWA-GLMM or SAIGE was well controlled under the null in the presence of relatedness. Fifth, the inclusion of rare variants in the association analysis increases the multiple testing burden. Hence, in the present study, following the guidelines from previous studies 32, 33 , we used P ≤ 5 × 10 −9 instead of 5 × 10 −8 as the genome-wide significance threshold.
Despite these caveats, fastGWA-GLMM is a highly efficient GLMM-based method that is applicable to GWA analyses of many binary phenotypes in biobank-scale data. The extensive simulations under different parameter settings and the real data analyses of nearly 3,000 UKB traits demonstrate its statistical robustness and computational efficiency. We believe that fastGWA-GLMM is a very useful tool for current and upcoming large-scale data, and the summary statistics released from the present study will be useful for future efforts to gain insights into the genetic basis of many health-related outcomes.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41588-021-00954-4.

Ethics approval. The present study was approved by the University of Queensland Human Research Ethics Committee B (2011001173) and the Ethics Committee of Westlake University (20200722YJ001).
Estimating the variance components. As described in Results, the fastGWA-GLMM model can be written as logit (μ) = xs β s + Xc β c + g. The logit function, logit (μ) = log( μ 1−μ ), is a commonly used link function in GLMM that links the expectation of the dependent binary variable y to a linear predictor that involves the independent variables. Solving this full model repeatedly for each variant is computationally unfeasible in large samples, so a common strategy is to first solve σ 2 g as well as the other essential components under the null model, that is, logit (μ) = Xc β c + g, and then calculate the score statistic for each variant based on the estimates from the null model. This strategy has been adopted by many existing LMM and GLMM methods 14,31,[34][35][36][37][38][39][40][41] .
Following GMMAT 41 and SAIGE 14 , the log(quasi-likelihood) of the null model is: is the quasi-likelihood (ql) for the ith individual given the random effect g, and a i is a known constant that will be omitted during the derivation. Following the derivations in Chen et al. 41 can be written as: An iterative approach is required to compute the quasi-likelihood and estimate the parameters. A commonly used algorithm is the average information REML (AI-REML) 42 , which has been adopted by both GMMAT 41 and SAIGE 14 . Leveraging the sparsity of π, we propose a grid search-based REML approach (called fastGWA-B-REML) with a special optimizer (see in the following) that can directly maximize ql ( β c , σ 2 g ) and return a maximum likelihood estimate of σ 2 g , which is often orders of magnitude faster and more robust than AI-REML, especially for traits with extremely unbalanced case-control ratios. A brief summary of fastGWA-B-REML is shown as follows. Let subscript i denote the iteration step with i starting from 0: (1) β c(i) is estimated from a standard logistic regression (that is, logit (μ) = Xc β c ), which is used as the starting value for β c (2) σ 2 g(i) and ĝ i are set to 0 given β c(i) and ĝ i (see details in next section) ≤ a threshold (by default 5 × 10 −5 ).
We use the sparse matrix Cholesky's decomposition algorithm implemented in the Eigen C++ library (http://eigen.tuxfamily.org) to compute the terms involving |V| or V −1 in a very efficient manner.
The grid search-based fastGWA-B-REML optimizer. As mentioned above, we have developed a grid search-based optimizer to estimate σ 2 g , which is more robust and often orders of magnitude faster than AI-REML. In the ith iteration of the parameter estimation step of the fastGWA-B-REML method, the grid search-based optimizer runs as follows: given each value of σ 2 g , and select the flanking grids of the σ 2 g value that produces the maximum quasi-likelihood to form a finer-scale searching interval (denoted by [σ 2 low,0 ,σ 2 up,0 ]) for the fine-tuning step below. ). This fine-tuning step is repeated four times, and σ 2 max = (σ 2 low,5 +σ 2 up,5 )/2 is returned as an estimate of σ 2 g(i) for the ith iteration of fastGWA-B-REML. The l (i) is the lower bound of the grid, determined by the variance components from the previous iteration. If we let σ 2 , then l (i) is set to 0 when i ≤ 3 or σ 2 g(i) ≤ 0.1 or set to 0.8σ 2 g(i) when i > 3 and σ 2 g(i) > 0.1. Similarly, the u (i) is the upper bound of the grid, which is set to Ỹ 2 i when i = 1, to 10σ 2 g(i−1) when i = 2 and 3, or to 1.2σ 2 g(i) when i ≥ 3. The k is the number of steps in the grid, which is set to 800 when i = 1, to 200 when i = 2 and 3, or to 50 when i ≥ 3. We apply the settings above to determine the boundaries and grid steps. We know from simulations that three iterations are sufficient to identify a reasonable interval for σ 2 g . We do not adopt the commonly used conventional optimizers (for example, the golden-section search) because the domain of ql does not always cover the whole range of [0,Ỹ 2 i ]. Therefore, it is difficult to choose an appropriate searching interval [l (i) , u (i) ] for the conventional optimizers, which would lead to a local optimum. On the other hand, we do not adopt AI-REML as implemented in SAIGE 14 because AI-REML often fails to converge when the case-control ratio is low. For example, in our simulations, SAIGE did not converge in 26 out of 100 simulation replicates under the simulation scenario with prevalence equal to 0.005.
Computing the score test statistic by the GRAMMAR-GAMMA approximation. As mentioned above, the fastGWA-GLMM method comprises two steps: the estimation step and the association test step. After obtaining all the necessary estimates from the null model in the parameter estimation step, we perform the association test for each variant using the score test: We know from prior work 14 The score test P value can be computed based on . Tscore can be computed efficiently because it only involves vector multiplication and (y −μ) needs to be calculated only once. However, var (Tscore) is difficult to obtain because P is an n × n dense matrix, and x T s Pxs needs to be evaluated repeatedly for every variant. The GRAMMAR-GAMMA approximation is a method to tackle this problem in LMM-based GWA analysis for quantitative traits 39 , and has been extended to cope with GLMMs in SAIGE 14 . In brief, for a random variant, its gamma ratio is approximately constant regardless of its genotypes. The denominator is easy to compute because W is an n × n diagonal matrix. Therefore, by randomly selecting m variants (the default m value is 200 in fastGWA-GLMM), we first estimate the mean of the gamma ratio by γ = 1 m ∑xT s Pxs x T s Wxs and then calculate var(Tscore) ≈γx T s Wxs for all the variants 14 . This strategy avoids computing var (Tscore) =x T s Pxs repeatedly for each variant and reduces the computational complexity of the association test step to nearly O(mn). The runtime can be further reduced by an approximate covariate adjustment approach, especially when the number of covariates is large (for example, c > 20). The full derivation of the approximate covariate adjustment approach has been described in the Supplementary Note. We observed from real data applications that the difference between the test statistics of the approximate and exact methods was very small (Supplementary Fig. 1). In our software tool, users can mute the approximation method and, in that case, it is a fewfold slower than the default version, depending on the size of c.
Correcting for genomic inflation by SPA. After obtaining the score test statistics, we calibrate the fastGWA-GLMM P values by SPA 43,44 to avoid potential inflation driven by case-control imbalance. The SPA method has recently been improved to cope with GWAS data (called fastSPA) 14,27 . FastSPA was originally implemented in R 27 . To improve the computational efficiency, we implemented fastSPA by highly optimized C++ codes in fastGWA-GLMM. By default, fastGWA-GLMM applies the fastSPA correction to variants with χ 2 test statistics ≥4 (Supplementary Note).
Gene-based tests. We have implemented in GCTA two gene-based test methods: the fastGWA-GLMM burden test (named fastGWA-BB) and the ACAT-V 16 . FastGWA-BB is an extension of the single-variant test in fastGWA-GLMM to assess the association of a weighted rare allele count of all the tested rare variants in or around a gene with the phenotype, conditioning on the sparse GRM (Supplementary Note). The variants weighting method follows that in SAIGE-Burden 19 . As with the single-variant association tests, the SPA correction is applied to adjust the gene-based test P values for case-control imbalance. ACAT-V is a general and flexible gene-based test method. It requires only summary statistics and does not rely on a specific assumption about the direction or the distribution of the effects of the rare alleles. The tail of the null distribution of the ACAT-V test statistic can be well approximated by a Cauchy distribution regardless of the linkage disequilibrium (LD) structure of the variants 16 . Hence, LD information is not needed (Supplementary Note). We have implemented an efficient version of ACAT-V in GCTA by C/C++ code.
The UKB data. The UKB is a large cohort study consisting of approximately 500,000 participants aged between 40 and 69 years at recruitment, with extensive phenotypic records 1 . All the UKB participants signed written informed consent, and the protocols were approved by the National Research Ethics Service Committee. In the present study, 456,348 UKB participants of European ancestry were selected for simulation and real data analyses. Genetic data were genotyped by two different arrays: the Applied Biosystems UK Biobank Axiom Array and the Applied Biosystems UK BiLEVE Axiom Array 1 . SNP imputation was conducted by the UKB analysis team using whole-genome sequence data from the Haplotype Reference Consortium 45 and the UK10K project 46 as the reference panels. The imputed data were filtered with standard quality control (QC) criteria in PLINK2 (ref. 7 ), for example, MAF ≥ 0.0001, Hardy-Weinberg equilibrium test P ≥ 10 −6 , genotyping rate ≥0.9 and imputation info score ≥0.8, resulting in 11,842,647 imputed variants (8,020,670 common and 3,821,977 rare) after QC. Note: we used 588,927 genotyped variants for the simulation study and 11,842,647 imputed variants for real data analyses 1 .

Simulation.
To assess the statistical performance of fastGWA-GLMM, we simulated 100,000 artificial individuals with a moderate proportion of relatives (10% of all samples) and substantial population stratification (to mimic two different ancestry backgrounds). A 'mosaic-chromosome' scheme modified from Loh et al. 40 was used to generate the artificial individuals (Supplementary Note). The difference of the current simulation process from that of Jiang et al. 10 was the inclusion of 32,658 genotyped rare variants from the UKB, with MAF ranging from 0.01 to 0.0001.
A set of different parameters was used to simulate a binary phenotype. We started from simulating a quantitative phenotype for the 100,000 simulated individuals based on the model below: xcomibcomi is the sum of the genetic effects of m 1 common causal variants (MAF ≥ 0.01) with xcomi being a vector of variant genotypes and xrareibrarei is the sum of the genetic effects of m 2 rare causal variants (MAF < 0.01) with xrarei being a vector of variant genotypes and brarei ∼ N(0, 1); z is a vector consisting of 0 (British) and 1 (Irish) to indicate ancestry with b p being the mean difference in phenotype between the two ancestry groups; e C is a vector of common environmental effects shared among individuals in the same families with e C ∼ N(0, Iσ 2 C ); and e is a vector of residuals with e ∼ N(0, Iσ 2 e ). The causal variants (m 1 = 10,000 and m 2 = 1,000) were randomly sampled from variants on the odd chromosomes, so the variants on the even chromosomes could be treated as the null variants to quantify type 1 error rate. We varied the variance of the common environmental effects in different simulation scenarios including the following (see Supplementary Note for detailed description of the parameter settings): (1) no common environmental effects (denoted by 'noEnv'); (2) common environmental effects explaining 10% of V p among the first-and second-degree relatives (denoted by 'comEnv').
After obtaining the quantitative phenotypic value for each individual, we dichotomized the phenotype given seven sample prevalence rates (that is, 0.3, 0.2, 0.1, 0.05, 0.01, 0.005 or 0.001) to convert it to a binary phenotype. Each simulation was repeated 100 times.
Assessing the FPR and statistical power. Four different methods, SAIGE, LR-All, LR-unRel and fastGWA-GLMM, were used to conduct GWA analyses of the simulated data. The top ten PCs computed from a set of LD-pruned variants (MAF ≥ 0.01, window size = 1 Mb, step size = 50 variants and LD r 2 threshold = 0.05) using flashPCA2 (ref. 47 ) were included in the association analysis as fixed covariates. For fastGWA-GLMM, 538,752 common variants with MAF ≥ 01 were used to compute the sparse GRM, whereas, for SAIGE, as recommended in the software documentation, a set of 78,295 LD-pruned common variants were used as 'ModelSNPs' for the estimation of the additive genetic variance (Supplementary Note). After performing GWA analyses of the simulated data, we quantified the FPR using the null variants on the even chromosomes and the power using the mean χ 2 of the causal variants for each method in each simulation scenario. We additionally used the AUC at different significance levels to compare the power between methods when evaluated at the same FPR level (Supplementary Note). Moreover, we also measured the statistical performance of each method for common (MAF ≥ 0.01) and rare (MAF < 0.01) variants separately.
Real data analyses. We used fastGWA-GLMM to perform GWA analyses of 2,989 binary traits in the UKB. Participants with imputed SNP data and labeled as European ancestry (UKB datafield 1001) were included in the analyses (n = 456,348 and m = 11,842,647). Of all the traits, 2,154 were generated based on the QC pipeline from the Neale Lab (https://github.com/Nealelab/UK_Biobank_GWAS) using the PHESANT package 48 , which were either originally dichotomous or transformed from multi-categorical traits. The other traits were generated from the ICD, 10th edn (ICD-10) 49 records from the UKB. The original ICD-10 records provided by the UKB were unsorted text data (UKB datafield 41202), which were incompatible with the PHESANT package. Therefore, we first extracted every unique ICD-10 code for each individual, and then grouped the ICD-10 codes into different Phecodes based on the ICD-10 to Phecode v.1.2 map 20 . Any individual not labeled with a particular Phecode was treated as a control for that Phecode. We did not remove individuals with relevant diseases from the control group to avoid selection bias 50 . Eventually, 835 Phecode traits were retained for further analysis. We removed traits with n cases < 100 or n total < 5,000 and retained 2,989 traits in total. We fitted age, age 2 , sex, age × sex, age 2 × sex and the top 20 PCs provided by the UKB as covariates in the GWA analysis (note: only age, age 2 and the top 20 PCs were fitted for the sex-specific traits). We also applied SAIGE to all the 456,348 individuals and PLINK2 logistic regression to 348,501 unrelated individuals for 8 binary phenotypes selected from the UKB for comparison with fastGWA-GLMM. The same covariates were fitted, and details of the parameter settings of SAIGE and PLINK2 are described in the Supplementary Note. Clumping analyses were performed using the GWAS results from fastGWA-GLMM, SAIGE and PLINK2, respectively (LD-clumping parameters used: P value threshold = 5 × 10 −9 , window size = 5 Mb and LD r 2 threshold = 0.01) for each of the 8 phenotypes.

Statistics and reproducibility.
In all the association analyses, we used a one-sided χ 2 d.f.=1 statistic to test against the null hypothesis of no association (that is, H0 : Tscore = 0). Given an effective sample size of >350,000, we have >99.8% power to detect a genetic variant explaining only 0.02% of the phenotypic variance at P < 5 × 10 −8 . We excluded UKB participants of non-European ancestries because the sample sizes were insufficient to compare the methods' computing performance or identify associated variants for most traits. Standard QC criteria were applied to clean genetic variants to avoid the inclusion of low-quality variants in the association analyses. We did not use any study design that required randomization or blinding. The GCTA software source code and the analysis code to produce the main results presented in the present paper are available at Zenodo 51,52 . Fig. 2 | FPR for sAiGe, fastGWA-GLMM and ReGeNie quantified using the null common variants in simulations. Three methods, SAIGe, fastGWA-GLMM, and reGeNIe, are compared. The y-axis represents the FPr computed from the null common variants (that is, all the common variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence = ncase/(ncase + n control )). FPr is evaluated at five different alpha levels (α=0.05, 0.005, 5×10 −4 , 5×10 −5 , and 5×10 −6 ), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPrs (that is, the alpha levels). each boxplot represents the distribution of FPr across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQr), whiskers indicate data up to 1.5 times the IQr, and outliers are shown as separate dots. In all the analyses, we used a one-sided χ 2 d.f.=1 statistic to test against the null hypothesis of no association. Fig. 3 | FPR for sAiGe, fastGWA-GLMM and ReGeNie quantified using the rare null variants in simulations. Three methods, SAIGe, fastGWA-GLMM, and reGeNIe, are compared. The y-axis represents the FPr computed from the null rare variants (that is, all the rare variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence = ncase/(ncase + n control )). FPr is evaluated at five different alpha levels (α=0.05, 0.005, 5×10 −4 , 5×10 −5 , and 5×10 −6 ), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPrs (that is, the alpha levels). each boxplot represents the distribution of FPr across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQr), whiskers indicate data up to 1.5 times the IQr, and outliers are shown as separate dots. In all the analyses, we used a one-sided χ 2 d.f.=1 statistic to test against the null hypothesis of no association. Fig. 4 | comparison of power (as measured by the mean χ2 value of the causal variants) between sAiGe, fastGWA-GLMM and ReGeNie. The y-axis represents the mean χ 2 value of the causal variants (10,000 common and 1,000 rare causal variants on the odd chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence = ncase/(ncase + n control )). Apart from being evaluated for the 11,000 variants altogether in panel (a), the mean χ 2 value is also evaluated for common (MAF ≥ 0.01) and rare (MAF < 0.01) causal variants separately, as shown in panels b) and c), respectively. each boxplot represents the distribution of mean χ 2 across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQr), whiskers indicate data up to 1.5 times the IQr, and outliers are shown as separate dots. In all the analyses, we used a one-sided χ 2 d.f.=1 statistic to test against the null hypothesis of no association. Fig. 5 | FPR for fastGWA-GLMM and other methods quantified using all the null variants in simulations. The y-axis represents the FPr computed from the null variants (that is, all the variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence = ncase/(ncase + n control )). FPr is evaluated at five different alpha (P value threshold) levels (α=0.05, 0.005, 5×10 −4 , 5×10 −5 , and 5×10 −6 ), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPrs (that is, the alpha levels). each boxplot represents the distribution of FPr across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQr), whiskers indicate data up to 1.5 times the IQr, and outliers are shown as separate dots. In all the analyses, we used a one-sided χ 2 d.f.=1 statistic to test against the null hypothesis of no association. Fig. 9 | statistical power for AcAt-V, fastGWA-BB, and ReGeNie-Burden under different simulation scenarios. Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and reGeNIe-Burden. The y-axis represents the power, defined as the proportion of the 100 simulated causal genes on chromosome 1 with P values less than the significance threshold after Bonferroni correction (that is, 0.05/1224=4.1×10 −5 , where 1,224 is the number of genes used in the simulation), and "Prev" on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as ncase/(ncase + n control )). We varied the proportion of variants being causal in a gene (5%, 20%, or 50%) and the directions of variant effects (random or consistent), as labelled in the title of each panel. each boxplot represents the distribution of power across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQr), whiskers indicate data up to 1.5 times the IQr, and outliers are shown as separate dot. In all the analyses, we used a one-sided χ 2 d.f.=1 statistic to test against the null hypothesis of no association. Fig. 10 | Area under the curve (Auc) for AcAt-V, fastGWA-BB, and ReGeNie-Burden under different simulation scenarios. Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and reGeNIe-Burden. The y-axis represents the AUC (that is, the area under the receiver operating characteristic (rOC) curve), and "Prev" on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as ncase/(ncase + n control )). We varied the proportion of variants being causal in a gene (5%, 20% or 50%) and the directions of variant effects (random vs. consistent), as labelled in the title of each panel. each boxplot represents the distribution of AUC across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQr), whiskers indicate data up to 1.5 times the IQr, and outliers are shown as separate dot. In all the analyses, we used a one-sided χ 2 d.f.=1 statistic to test against the null hypothesis of no association.