Genomic logit regression
Complex disease traits, as binary ones, usually follow binomial or Poisson distributions, so the generalized linear model (GLM) 1,2 is used to map QTLs controlling the traits. Assume that n individuals are recorded for phenotypic values and genotyped for m genetic markers. Distinguishing these markers from major and common alleles in a magnitude of effects of the markers on quantitative traits, we describe the relationship between all markers (predictors) and the mean of the exponential distribution family in the following logit regression:
$$\ln \left( {\tfrac{{\mathbf{\mu }}}{{{\mathbf{1}} - {\mathbf{\mu }}}}} \right)={\mathbf{Xb}}+{{\mathbf{Z}}_{\mathbf{1}}}{{\mathbf{a}}_{\mathbf{1}}}+{{\mathbf{Z}}_{\mathbf{2}}}{{\mathbf{a}}_{\mathbf{2}}}$$
1
where µ denotes the expectations of phenotypic distribution, b is the systematic environment effect; the population structure (stratification) which results in phenotypic differences among subpopulations is always considered here, except for sex, age, and some initial experimental conditions. a1 is the large genetic effect of q markers on phenotype, a2 is the minor or zero effect of the m-q markers on phenotype, and X, Z1, and Z2 are the corresponding design matrices of b, a1, and a2, respectively.
GRAMMAR for binary disease traits
We define the GBVs as
$${\mathbf{g}}={{\mathbf{Z}}_{\mathbf{1}}}{{\mathbf{a}}_{\mathbf{1}}}+{{\mathbf{Z}}_{\mathbf{2}}}{{\mathbf{a}}_{\mathbf{2}}}$$
2
Then, model (1) becomes
$$\ln \left( {\tfrac{{\mathbf{\mu }}}{{{\mathbf{1}} - {\mathbf{\mu }}}}} \right)={\mathbf{Xb}}+{\mathbf{g}}$$
3
which is a the GLMM 3. Under the assumption that \(({{\mathbf{a}}_{\mathbf{1}}}\, {{\mathbf{a}}_{\mathbf{2}}})\sim {N_m}({\mathbf{0}},\;{\mathbf{I}}\sigma _{a}^{2})\)with minor \(\sigma _{a}^{2}\) for each marker, the GBVs are turned into random effects and \({\mathbf{g}}\sim {N_n}({\mathbf{0}},\;{\mathbf{K}}\sigma _{g}^{2})\)with genomic variance of traits \(\sigma _{g}^{2}=m\sigma _{a}^{2}\) and the genomic relationship matrix (GRM) K 4. Based on the model (3), the GBVs can be estimated with the following GBLUP equations:
$$\left[ {\begin{array}{*{20}{c}} {{\mathbf{X}}_{{}}^{{\text{T}}}{\mathbf{WX}}}&{{\mathbf{X}}_{{}}^{{\text{T}}}{\mathbf{W}}} \\ {{\mathbf{WX}}}&{{{\mathbf{Z}}^{\mathbf{T}}}{\mathbf{WZ+}}\tfrac{{1 - {h^2}}}{{{h^2}}}{{\mathbf{K}}^{{\mathbf{-1}}}}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{\mathbf{\hat {b}}}} \\ {{\mathbf{\hat {g}}}} \end{array}} \right]=\left[ {\begin{array}{*{20}{c}} {{\mathbf{X}}_{{}}^{{\text{T}}}{\mathbf{W}}{{\mathbf{y}}^{\mathbf{*}}}} \\ {{\mathbf{W}}{{\mathbf{y}}^{\mathbf{*}}}} \end{array}} \right]$$
4
with
\({\mathbf{W}}={\mathbf{\mu }}({\mathbf{1}} - {\mathbf{\mu }})\) and \({{\mathbf{y}}^{\mathbf{*}}}{\text{=}}\ln \left( {\tfrac{{\mathbf{\mu }}}{{{\mathbf{1}} - {\mathbf{\mu }}}}} \right){\text{+}}\tfrac{{{\mathbf{y}} - {\mathbf{\mu }}}}{{{\mathbf{\mu }}({\mathbf{1}} - {\mathbf{\mu }})}}\).
where y is a binary phenotype, and \({h^2}=\tfrac{{\sigma _{g}^{2}}}{{\sigma _{g}^{2}+1}}\) is the unknown genomic heritability of liability with the residual variance of 1 to be assumed in GLMM, which can be estimated in advance using the REML for GLMM 3,5.
Unlike the normally distributed quantitative traits, the residuals for binary traits cannot be directly obtained due to the difference in scale between the predictors and response variables. Within the framework of GRAMMAR, thus, we eliminated polygenic effects on the binary phenotype by regarding the estimated GBVs \({\mathbf{\hat {g}}}\) as a known predictor in the following GLM:
$$\ln \left( {\tfrac{{\mathbf{\mu }}}{{{\mathbf{1}} - {\mathbf{\mu }}}}} \right)={{\mathbf{z}}_{{\text{SNP}}}}{a_{{\text{SNP}}}}+{\mathbf{\hat {g}}}$$
5
with a regression item \({{\mathbf{z}}_{{\text{SNP}}}}{a_{{\text{SNP}}}}\) of the SNP tested.
With the iteratively re-weighted least square method 1, we obtained the maximum likelihood estimate for the SNP effect as:
$${\hat {a}_{{\text{SNP}}}}={({\mathbf{z}}_{{{\text{SNP}}}}^{{\text{T}}}{\mathbf{W}}{{\mathbf{z}}_{{\text{SNP}}}})^{ - 1}}{\mathbf{z}}_{{{\text{SNP}}}}^{{\text{T}}}{\mathbf{W}}({{\mathbf{y}}^{\mathbf{*}}} - {\mathbf{\hat {g}}})$$
6
The test statistic to infer the association of the SNP with binary traits is generally formulated by
$${\chi ^2}=\frac{{\hat {a}_{{{\text{SNP}}}}^{2}}}{{{\mathbf{z}}_{{{\text{SNP}}}}^{{\text{T}}}{\mathbf{W}}{{\mathbf{z}}_{{\text{SNP}}}}}}$$
7
which is subject to the chi-squared distribution with the 1 degree of freedom.
Optimal Genomic control
In GRAMMAR for binary traits, replacement of polygenic effects excluding QTNs with GBVs deflates the test association statistics, which yields a high false negative rate. By regulating the downward genomic heritability, we can more accurately estimate the polygenic effects with the GBLUP Eq. (4). The polygenic heritability less than genomic heritability is determined by optimizing genomic control for association tests. Such an optimization for GRAMMAR can be summarized in the following steps:
-
Set the searching open interval of h2 to (0, 1)
-
Estimate the GBVs \({\mathbf{\hat {g}}}\) using Eq. (4);
-
Statistically infer the genetic effect for each SNP by the chi-squared statistic (7);
-
Calculate the genome-wide chi-squared mean or statistical probability for each SNP;
-
Plot the quantile-quantile (Q-Q) profile for genome-wide statistical probabilities;
-
Update h2 with Brent’s method 6;
-
Repeat step (2)-(6) until the genome-wide chi-squared mean reaches 1.0 plus or yields a satisfactory Q-Q plot.
Joint association analysis
After optimizing genomic control for GRAMMAR, we jointly analyzed multiple QTN candidates to improve the statistical power to detect QTNs for binary disease traits. Multiple QTN candidates were chosen within the interval of significance level 0.05 to the Bonferroni corrected criterion 7, so that the number of QTN candidates was limited to be no greater than the population size. Backward regression was adopted to optimize the multiple GLM with known optimized polygenic effects in a stepwise manner:
$$\ln \left( {\tfrac{{\mathbf{\mu }}}{{{\mathbf{1}} - {\mathbf{\mu }}}}} \right)={\mathbf{Xb}}{\text{+}}{{\mathbf{Z}}_{\mathbf{1}}}{{\mathbf{a}}_{\mathbf{1}}}+{\mathbf{\hat {g}}}$$
8
Given the Bonferroni corrected significance level, the significant QTN effects remained in the model (8) according to the corrected statistic (7).
Simulations
Two genomic datasets of human 8and maize 3 samples were used to simulate the adaptability of GRAMMAR for binary traits to population structure. The maize population has a more complex structure than the human population. Then, 300,000 SNPs for both 12000 people and 2640 maize were extracted through higher quality control. In whole simulations, control and case samples were constrained to 1:1 for the maize population, and 3000 cases were selected from the human population with low incidence rate of 5% simulated in advance. QTNs were distributed randomly over the entire SNPs, whose additive effects were sampled from a gamma distribution with shape = 1.66 and scale = 0.4. Given the genomic heritability of liability, phenotypes of control (0) and case (1) can be generated from the genomic logit model (1).
In addition to population structure, the number of QTNs, genome heritability, and sampling number of SNPs were considered as experimental factors in the simulations. Under the optimized genomic control infinitely close to 1.0, the ROC profiles can be plotted by statistical powers to detect the QTNs relative to a given series of Type I errors. Statistical powers are defined as the percentage of identified QTNs that have the maximum test statistic among their 20 closest neighbors over the total number of simulated QTNs. Simulations were repeated 50 times, and in each simulation, the positions and effects of QTNs simulated were varied and the average results were recorded.
Method References
1. Wedderburn, R.W.M. Quasi-likelihood functions, generalized linear models, and the gauss-newton method. Biometrika 61, 439–447 (1974).
2. McCullagh, P. & Nelder, J.A. Generalized linear models, 2nd ed., (Chapman and Hall, New York, 1989).
3. Breslow, N.E. & Clayton, D.G. Approximate inference in generalized linear mixed models. Journal of the American statistical Association 88, 9–25 (1993).
4. Vanraden, P.M. Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414–4423 (2008).
5. Chen, H. et al. Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. Am J Hum Genet 98, 653 − 66 (2016).
6. Brent, R.P. Algorithms for minimization without derivatives, (Prentice-Hall, New Jersey, 1973).
7. Hochberg, Y. & Tamhane, A.C. Multiple Comparison Procedures, (John Wiley & Sons, Inc., New York, 1987).
8. Romay, M.C. et al. Comprehensive genotyping of the USA national maize inbred seed bank. Genome biology 14, R55 (2013).