Genetic relationship matrix
In the scenario where genotype data of individuals is sampled from K populations, there areindividuals in population k and the number of individuals in the total population is. We have M SNPs, whose frequencies of their reference alleles in population k are . Let X be the genotype matrix of dimension. The element X(n,m) is coded as the number of the reference allele of SNP m for individual n, where and . Typically, the number of individuals is less than the number of markers, i.e. We assume that all SNPs are under the Hardy-Weinberg equilibrium in each population. The GRM can be calculated as, where each entry of Y is a normalized version of the coded genotype in X, i.e.
for and . Here, and denote the genotypic mean and standard deviation of SNP m in the total population, respectively. It can be shown that and relate to the population structure and allele frequencies as follows (Supplemental Text S1).
The reference-allele frequency of SNP m in the total population can be found as , where . The GRM is of dimension , whose diagonal elements are genotypic variance of individuals and off-diagonal elements are genotypic covariance between two individuals. It should be noted that genotypes follow mixed binomial distributions, and elements of are sample variances and covariance computed from the genotype data. The PCA analysis of population stratification is based on the eigen-analysis of the observed GRM .
In practice, and are unknown, and therefore sample mean and sample standard deviation or some other quantities similar are used instead. Usually, is used for the centralization in (2). In EIGENSOFT, is adopted for the normalization in (2), while is employed in GCTA [22]. Different estimates of the allele frequencywere suggested as well [5, 6]. In the following theoretical derivations, we assume that and are known for the sake of simplicity. This will bring about clear connections between variance and covariance elements of the EGRM and allele frequencies of the SNPs used in the analysis. The connections further provide clues and insights for understanding the effect of rare variants on the PCA of population stratification.
Expected genetic relationship matrix
We assume that all individuals are unrelated and all SNPs are in linkage equilibrium. When the number of markers M goes large, the sample variance and covariance elements in the GRM will converge to their mathematical expectations in probability due to the law of large numbers. We denote the EGRM as Z, which is the expectation of the GRM . Without loss of generality, we assume that individuals are ordered according to their population memberships. As such, the first rows and columns correspond to individuals from population 1, the next rows and columns are from population 2, and so on. Thus, the EGRM can be partitioned into a block matrix consisting of submatrices.
Diagonal sub-matrices of the EGRM Z have the following structure.
Here, diagonal elements of the submatrix are of the mathematical form which is the genotypic variance of individuals from population k. All off-diagonal elements share the form which is the genotypic covariance between two individuals from population k.
The off-diagonal sub-matrices of the EGRM Z are structured as follows
Elements of share the value which is the genotypic covariance between one individual from population k and one from population l. Details of the derivations are presented in Supplemental Text S2.
The EGRM Z, the mathematical expectation of GRM Z, depends only on the population sizesand allele frequencies of the M SNPs in K populations ,. Here, we treat the SNP number M and the allele frequencies as fixed numbers. A theoretical formulation of the PCA that considers allele frequencies in different populations as a random vector was proposed in Ma and Amos, 2010 [23]. Although based on different assumptions, a genotypic variance-covariance matrix of the same structure was attained, whereas elements of the EGRM are expressed as explicit functions of allele frequencies. This allows for the investigation of the effect of rare variants on the analysis of population stratification.
Rare variants on the eigenvalues
Carrying out eigen-decomposition on the EGRM, it can be shown that there are eigenvalues of value , for . The sum of the eigenvalues is (11)
Clearly, variations in the PCs are entirely intra-population variance of the K populations. The sum of the other K eigenvalues is, (12) where represents the inter-population variance component and stands for the intra-population variance component. Here, the intra-population covariance of the EGRM factor in the K PCs as the inter-population variance after the eigen-decomposition. Note that all inter-population variance is contained in the K PCs. Hence, it is sufficient to conduct the population stratification analysis based on the K PCs alone.
Given a set of SNPs, the divergence among the K populations can be measured by the ratio of the two variance components, i.e. (13)
Its normalized version can be defined as (14)
The larger the FPC is, the more divergence among the populations.
Note that , terms in are quadratic functions of , , . However, terms in are linear and quadratic functions of the frequencies. Therefore, FPC will decrease when the allele frequencies of the SNPs become smaller. As a result, instead of improving the population stratification analysis, using rare variants will deteriorate the analysis performance. Meanwhile, since decreases faster than , the K eigenvalues will be closer to the other eigenvalues when allele frequencies of the SNPs become smaller. Thus, the percent of variance explained by the K PCs will be smaller.
It can be shown that among the K eigenvalues, are of large values and one small. When intra-population variance of the K populations are equal, all inter-population variance is contained in the largest eigenvalues. In addition, when the sample size is large and the portions of populations remain, any inter-population variance contained in the small eigenvalue is negligible, almost all information on the population structure is contained in the largest PCs.
For cases with two populations, it can be shown that the two eigenvalues are.
When inter-population variance of the two populations are equal, i.e. , we have.
That is, all information on the population structure is contained in the largest PC. All proofs are presented in Supplemental Text S3.
Rare variants on the population distance
Suppose that , are the eigenvectors associated with the K eigenvalues containing inter-population variance. We can represent each individual as a point in the K-dimension space. Vector consists of coordinates of N individuals in the k-th dimension. Average value represents center of the total population in the k-th dimension, where is the vector of dimension and with each element as 1. Due to the structure of Z, individuals from the same population share the same coordinates in the K-dimension space, and the common points denote the representative points of the populations, or centers of the populations.23 We define (15) which measures the population divergence as the sum of squared distances among populations. The proof is shown in Supplemental Text S4. Note that the second term in (15) is an average intra-population variance. As explained earlier that when allele frequencies become smaller, the K eigenvalues decrease. Hence, the population distance is smaller when using rare SNPs compared to common ones.
The 1000 Genomes Project data
We used genotype data from the 1000 Genomes Project to validate our theoretical results. Genotype data used in this work was obtained from Phase 3 version 5a of the 1000 Genomes Project [16, 17], which contains 84.4 million genetic markers and 2504 individuals with ancestry from Europe (EUR), East Asia (EAS), South Asia (SAS), West Africa (AFR) and Americas (AMR). We extracted biallelic SNPs that are polymorphic in the total population. In summary, there are 77,279,863 SNPs; 5,261,820 are common (0.1<MAF≤0.5), 6,770,457 are low-frequency (0.01<MAF≤0.1), and 65,247,586 are rare (0.0001<MAF≤0.01). Genotype data were converted to PLINK format with VCFtools [24]. The SNPs were divided into six frequency bins according to their MAFs, as shown in Table 1. For each bin, we randomly sampled approximately 100,000 SNPs using PLINK for the population stratification analyses. PCAs were carried out, with GRMs computed by EIGENSOFT and PCAs on EGRMs conducted using GCTA. Default parameters were used when analyzing with EIGENSOFT, which excluded 68 and 116 outliers in the analyses of the data from frequency bin 5 and 6, respectively.