FastGWA-GLMM: a generalized linear mixed model association tool for biobank-scale data

12 Compared to linear mixed model-based genome-wide association (GWA) methods, generalized 13 linear mixed model (GLMM)-based methods have better statistical properties when applied to 14 binary traits but are computationally much slower. Here, leveraging efficient sparse matrix-based 15 algorithms, we developed a GLMM-based GWA tool (called fastGWA-GLMM) that is orders of 16 magnitude faster than the state-of-the-art tool (e.g., ~37 times faster when 𝑛 = 400,000 ) with 17 more scalable memory usage. We show by simulation that the fastGWA-GLMM test-statistics of 18 both common and rare variants are well-calibrated under the null, even for traits with an extreme 19 case-control ratio (e.g., 0.1%). We applied fastGWA-GLMM to the UK Biobank data of 456,348 20 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at 21 http://fastgwa.info/ukbimpbin) and identified 259 rare variants associated with 75 traits, 22 demonstrating the use of imputed genotype data in a large cohort to discover rare variants for 23 binary complex traits.


26
Over the past decade, we have witnessed the tremendous growth of data from genome-wide 27 association studies (GWASs). For example, there are nearly half million genotyped individuals 28 with rich phenotypes in the UK Biobank (UKB) 1 , which have played a pivotal role in discovering 29 novel genotype-phenotype associations in recent years 2-6 . Nonetheless, the scale of biobank data practice is to remove rare variants (e.g., minor allele frequency, MAF < 0.01) and phenotypes with 41 a low case-control ratio (e.g., < 1:99) 9,10 , resulting in unnecessary loss of data. currently the most commonly used GLMM-based tool for biobank-scale data because of its 47 computational efficiency and well-calibrated test-statistics of both common and rare variants for 48 unbalanced binary traits. However, it is almost computationally prohibitive to use SAIGE to 49 analyse all the thousands of binary traits in the UKB, more so in cohorts with larger sample sizes 50 than the UKB (e.g., data accumulated in the direct-to-consumer genetic testing companies). The 51 main reason why the performance of SAIGE is encumbered is because of the manipulation of full-52 dense n  n matrices (although not explicitly computed) with n being the sample size, which is 53 both time-and resource-consuming.

55
In our previous work, we developed an LMM-based GWA tool, fastGWA, that is orders of 56 magnitude faster than BOLT-LMM, mainly owing to the use of a sparse genomic relationship 57 matrix (GRM) to capture pedigree relatedness among individuals 10 . However, when we applied 58 fastGWA in the GWA analyses of all the UKB traits, we had to remove around 3 million rare 59 variants (MAF  0.01) and ~1,000 traits with a case-control ratio < 1:99 to avoid the inflation in 60 FPR mentioned above 10 . In this study, we aim to develop a GWA tool that is scalable to GWAS data 61 of over a million individuals and applicable to both common and rare variants for all binary 62 phenotypes including those with a low case-control ratio. To achieve this goal, we incorporated 63 GLMM into the fastGWA framework and developed efficient sparse matrix-based algorithms for 64 parameter estimation and association test. We name the method fastGWA-GLMM and 65 demonstrate by simulation that the test-statistics from fastGWA-GLMM are not inflated for either 66 common or rare variants even if the case-control ratio is extremely low (e.g., 0.1%). We then show 67 by analysing subsets of the UKB data that fastGWA-GLMM is orders magnitude faster than SAIGE 68 with more scalable memory usage (e.g., when n = 400,000, fastGWA-GLMM is ~34 times faster 69 and uses only about a third memory compared with SAIGE). From the speed test results, we 70 predict that fastGWA-GLMM is, in principle, applicable to GWAS data with sample sizes over a 71 million. We have implemented fastGWA-GLMM in the GCTA software package 15  where is an n×n projection matrix, which is dense despite being sparse. Therefore, to avoid 102 computational bottleneck due to matrix multiplication involving , a GLMM version of the 103 GRAMMAR-GAMMA approximation 14 is implemented in fastGWA-GLMM (Online Methods).

105
As for the inflation in test-statistics due to case-control imbalance, for any variant with a score 106 test p-value larger than a threshold (e.g., =1 2 = 2), SPA is applied to calibrate the test statistic.

107
In addition, to further improve computational efficiency, in fastGWA-GLMM, we developed an 108 approximate approach to account for covariates for variants with score test is >100 times slower than fastGWA-GLMM and not applicable to some of our simulation settings.

120
After randomly sampling subgroups of individuals (n ranged from 50,000 to 400,000) from the 121 UKB, we performed a GWA in each subset of data using fastGWA-GLMM and SAIGE respectively, 122 on a computing platform with 80 GB memory and 8 CPU cores. The trait used for comparison is 123 "Irritability" (case-control ratio = 0.39; UKB data-field: 1940). The genotype data were stored in 124 BGEN v1.3 format 16 . Each test was repeated 5 times for an average of runtime and memory usage.

125
As shown in Figure 1a, for a GWA with n = 400,000, fastGWA-GLMM only required 4.9 hours, 126 which is ~37 times more efficient than SAIGE. Besides, the runtime of the estimation step of

152
The results showed that when the prevalence was larger than 0.05, the FPRs of the null variants

164
We partitioned all the null variants into two groups (common and rare variants) based on an MAF 165 threshold of 0.01 and evaluated the FPR of the two groups separately in each simulation scenario.

166
The FPRs for rare variants (MAF < 0.01) from LR-unRel and LR-All were substantially inflated in 167 the scenarios with low prevalences, while those from fastGWA-GLMM remained consistent with 168 the expected values for both common and rare variants regardless of the prevalence level 169 ( Supplementary Figures 3 and 4). SAIGE showed similar performance as fastGWA-GLMM, 170 except that it showed more deflated FPRs than all the other methods when prevalence = 0.005 171 due to its convergence issue as described above.

173
Next, we quantified the statistical power of different methods by calculating the mean 2 statistic 174 at the causal variants. We found that the power of fastGWA-GLMM was slightly higher than that 175 of SAIGE (Figure 3). The mean 2 statistic of LR-All and LR-unRel was not informative in this case 176 as it suffered from inflation driven by both relatedness and case-control imbalance. We then 177 quantified the power of common and rare causal variants separately. The patterns were similar 178 between common and rare variants, though the power to detect the rare causal variants was 179 lower than that for the common causal variants (Figure 3). We also used the area under the curve 180 (AUC) as a metric to compare the difference in power between the methods given the same level   clumping, the number of quasi-independent signals from fastGWA-GLMM was also higher than 208 that from SAIGE or LR-unRel (Supplementary Table 6). As for case-control imbalance, the 209 results from LR-unRel started to exhibit inflation for traits with prevalence < 0.01 (see the 3 rd

236
The major advantage of fastGWA-GLMM over LR-unRel is that it does not need to remove related 237 individuals from the study, as the relatedness can be well accounted for by a pedigree relatedness 238 matrix or a sparse GRM. Take the real data application in the UKB as an example. FastGWA-GLMM 239 was able to include all 456,348 participants into the association test, while LR-unRel could only 240 utilize information from 348,456 unrelated participants. Since most of the large population-based 241 cohorts rely on an assessment-centre based recruitment strategy, the proportion of relatives in 242 the cohorts tends to be high and will keep increasing in the future 1 . In such case, it is crucial to 243 avoid removing data of related individuals. Another advantage of fastGWA-GLMM over LR-unRel 244 is its efficiency. FastGWA-GLMM, as many other GLMM-based methods, uses a score statistic for 245 association test, which is computationally easy to compute (Online Methods). In contrast, LR-246 unRel as in PLINK2 is based on an iteratively reweighted least squares method and the Wald's 247 test that solves the full model for each variant repeatedly, which is much slower than the score 248 test especially when covariates are included.    > 2); note 278 that the SPA correction is also applied when this threshold is met (Methods). This strategy allows 279 fastGWA-GLMM to omit the computation of matrix multiplication between the covariate matrix 280 and ~95% of the genotype vectors. We confirmed that the difference of test statistics between 281 the approximate covariate-adjustment approach and the exact approach is negligible, and only 282 variants with =1 2 < 2 might suffer from slight deflation in test-statistics which does affect the 283 power of detecting association at a genome-wide significance level (Supplementary Figure 13).

284
This strategy is particularly useful when the number of covariates is large (e.g., larger than 20).

285
In our software tool, there is an option to allow users to switch off this approximation and force 286 all the statistics to be calculated by the covariate-adjusted genotypes, which will cause a loss of 287 computational efficiency by a few folds, depending on the number of covariates.

289
There are a few caveats when applying fastGWA-GLMM in practice. First, if pedigree data are not 290 usable, a sparse GRM needs to be pre-computed from the SNP data. A very efficient parallelized 291 algorithm has been implemented in GCTA to compute the sparse GRM 10 . Since the sparse GRM 292 setting has already been adopted by fastGWA 10 , once generated, the same sparse GRM of a cohort 293 can be used for GWA analyses of all the quantitative and binary phenotypes. Therefore, the 294 average computational cost per trait is minimal. Second, the ̂2 estimated from fastGWA-GLMM-

295
REML cannot be interpreted as genetic variance or heritability. This is mainly due to the use of 296 the penalized quasi-likelihood and the Laplace method 14 . However, from our simulations and real 297 data applications, it did not affect the statistical performance of the association test of fastGWA-298 GLMM. Third, in our previous work, we found that when analysing quantitative traits the ̂2 299 estimated based on a sparse GRM might be a better quantity to control for relatedness than that 300 from dense-GRM-based methods 10 . In this study, however, when analysing binary traits, we 301 observed that fastGWA-GLMM did not have such advantage over SAIGE. Both fastGWA-GLMM and 302 SAIGE had well-controlled FPR. Fourth, the inclusion of rare variants in the association tests 303 increases the multiple testing burden. Hence, in this study, following the guideline from previous 304 studies 23,24 , we used p-value  510 -9 instead of 510 -8 as the genome-wide significance threshold.

306
Despite these caveats, fastGWA-GLMM is a highly efficient GLMM-based method that is applicable  34 , which has been adopted by both GMMAT 33 and SAIGE 14 .

337
Leveraging the sparsity of , we propose a grid-search-based REML approach (called fastGWA-338 GLMM-REML) with a special optimizer (see below) that can directly maximize ql( , 2 ) and 339 return a maximum likelihood estimate of 2 , which is often orders of magnitude faster and more

538
Each test was repeated 5 times for an average.