Data merging and filtering
KhoeSan genotype data from Martin and colleagues (15) was merged with the genetic data generated as part of the Population Architecture using Genomic and Epidemiology dataset (16) and genetic data from the Gujarati Indian and British populations from the 1000 Genomes Project (17). Preliminary data filtering included a filter for minor allele frequency (0.003), missingness per genotype (max 0.05) and missingness per individual (max 0.01). A total of ~776k SNPs passed these filters and formed the initial merged dataset. Further data filtering is described in the appropriate sections below. Data was phased using SHAPEIT2 utilizing the published African American HapMap recombination map (18,19). Populations in the final dataset are summarised in Table 1.
Simulations
The computational workflow is summarised in Figure 1. A random subset of 55 reference individuals from the final merged dataset described in Table 1 was used to generate a simulated dataset using admix-simu (11 per reference population) (20). The remaining 444 reference individuals formed the reference dataset for GAI and LAI. A demographic model consisting of specific ancestry proportions and timing of migration, leading a continuous migration model initializing at 15 generations ago, was used to generate a simulated 5-way admixed population (3) (please see Table S1 for the specific admixing proportions). This simulation results in a heterogenous population, reminiscent of a real-world SAC population (see Table 2).
The simulation does not take post-admixture selection into account since it is highly unlikely that 350 years would result in distinct selection signals, rather, the inherent selection signals in the source populations will be transferred in a random manner to the simulated admixed population (adaptive introgression). Genotype as well as local ancestry calls were generated for this simulated dataset from real reference haplotypes, thus capturing the complexity of this heterogenous 5-way admixed South African population.
Software choices
Although there are a number of software programs that are able to estimate global ancestry, ADMIXTURE is the most utilized. Reasons for this include the ability to include related individuals in one run and to generate accurate admixture proportions using relatively low-density SNP-array data (11). The other widely used global ancestry algorithm, STRUCTURE has been shown to overestimate admixture proportions in even simple admixture scenarios, therefore given the demographic history of the population presented here, this software was not used (21).
RFMix was chosen as the local ancestry inference algorithm of choice as it allows for parameter optimization given the number of ancestral populations and the ability to perform LAI in populations more than 2-way or 3-way admixed (limitations of LAMP (22) and HAPMIX (23)). In addition, RFMix has the inherent ability to calculate local and global ancestry simultaneously and allows for array-based input data as well as whole genome sequencing data. Furthermore, initial results by Daya and colleagues suggested that RFMix is the most accurate tool for local ancestry estimation (over and above that calculated for LAMP-LD (24,25)) in admixed southern African populations however, only a 3-way admixture scenario was tested (San, Bantu-speaking and non-African) (13).
GAI accuracy
Reference individuals not included in the dataset used for the simulation, were allocated to the dataset used for GAI and LAI. Global ancestry proportions were determined by ADMIXTURE (11) and RFMix (9).
The ADMIXTURE analysis was performed in a supervised and unsupervised manner after filtering the dataset for linkage disequilibrium as per the manual’s recommendations (50kb window size, step size of 10kb and R2 threshold of 0.1). The supervised algorithm allows for the input of know ancestral origins of the reference individuals whereas the unsupervised algorithm infers the ancestry of all individuals given as input.
RFMix was run using default parameters, a time since admixture of 15 generations (in line with the simulation) as well as 3 expectation-maximization (EM) iterations (further EM iterations were not shown to increase accuracy (9)). The correlation of the two methods by means of the Root Mean Squared Error (RMSE) was performed in R.
LAI accuracy
Local ancestry calls were generated by RFMix using the same parameters as described in the previous section. The ability to correctly assign local ancestry was calculated in two ways, at an individual level. The first determined the global accuracy using the formula , where is the number of sites that had a called ancestry and is the number of sites that had a correctly called ancestry as compared to the simulations. The second method of accuracy estimation looked at this accuracy per ancestral population using the formula where is the number of sites that had a called ancestry and is the number of sites that the specific ancestry was correctly called (26). These accuracy estimators were then averaged over all individuals in the simulated 5-way admixed dataset.