Theoretical and empirical EGRMs
To calculate the theoretical results (5)-(10), we computed MAFs of the SNPs with PLINK. Values of variance and covariance , , , were calculated as in (7), (8), and (10), respectively, in which was computed with (3) and , . Values of and for the five populations with SNPs from the six frequency bins are presented in Table 2. Absolute values of the inter-population covariance are much smaller and the results are shown in Supplemental Tables S1-6.
To obtain the empirical values of variance , as well as covariance and , we first computed GRMs with SNPs from the six bins using EIGENSOFT. Each GRM included variance and covariance terms of N individuals based on the observed genotype data. Empirical value of was computed as the average variance of the individuals from population k. The empirical value of is the average covariance of pairs of individuals from population k. Lastly, the value of is the average covariance of pairs of individuals, one from population k and one from population l. The results of and are shown in Table 2, and those of are presented in Supplemental Tables S1-6.
We can see that across the six frequency bins, theoretical values of , , and predicted by (7), (8), and (10), respectively, are close to their empirical values. When MAFs of the SNPs become smaller, intra-population covariance decreases. For example, was 0.2 for EAS with SNPs whose MAFs are between 0.4 and 0.5, which reduced to 0.003 in the sixth bin that included rare SNPs only. A similar pattern can be observed for the other four populations. FPC was estimated by (13) for the six bins, where empirical values of and were used. The FPC decreases from 93.85 in bin 1 to 55.01 in bin 5, and further to 1.83 in bin 6. Thus the divergence among the populations is much larger when measured by common SNPs than by rare ones.
PCAs of the 1000 Genomes Project data
With genotypes of SNPs from each frequency bin, we carried out PCAs of population stratification by EIGENSOFT, which was essentially based on the eigen-analysis of the observed GRMs. Scatter plots of the largest three PCs are shown in Figures 1 and 2, where eigenvectors were scaled by square roots of their corresponding eigenvalues.
From Figures 1 and 2, we can see patterns of population structure computed with common and less-frequency SNPs. For example, Figures 1a-e and 2a-e displayed similar patterns, whereas the scatter plots based on rare SNPs differed significantly. For example, AMR and SAS are separated mostly by the third PC with common SNPs, while they are distinguished by the second PC with rare ones. The third PC from rare SNPs reveals mostly substructure of AFR, likely because more rare SNPs are polymorphic in AFR than in other populations. Portions of variance explained by the largest five PCs decrease from 17.09% in bin 1 to 10.41% in bin 5, and it falls dramatically to 0.74% with rare SNPs only. As a result, the five populations are more closely distributed around the origin in Figures 1f and 2f, compared with those in Figures 1a-e and 2a-e. Clearly, common variants show much better performance in dissecting the population structure than rare variants do.
PCAs of EGRMs
For each frequency bin, we also constructed a EGRM with structure as described in (5), (6), and (9), whose variance and covariance elements were their theoretical values calculated by (7), (8), and (10), respectively. We conducted PCAs of the EGRMs using GCTA, and scatter plots of the largest three PCs shown in Figures 1 and 2. Similarly, coordinates were scaled by square roots of their eigenvalues.
Upon comparing the representative points in Figures 1 and 2, we can see that distances between populations decrease as the SNPs change from common to rare. Sum of the squared distance was calculated for the six frequency bins by (15), where , were the eigenvalues of the EGRM Z and , , were their theoretical values. The decreases from 444.38 in bin 1 to 254.10 in bin 5, and further to 17.83 in bin 6.
In addition, when portions of variance explained by the PCs become small, deviations between the representative points of the populations and true centers of the populations can be observed. This is particularly evident in the scatter plots with rare SNPs. In the PCAs of a single population, such deviations are more obvious when percents of variance explained by the largest PCs are much smaller.