ALLELE FOR 20 AUTOSOMAL MICROSATELLITE IN THE KENYAN POPULATION

Samples from 180 unrelated persons of Kenyan descent collected at a DNA testing facility in Nairobi were genotyped using the PowerPlex21® STR kit to generate the first indigenous 20 autosomal STR allele frequency table for use in forensic analysis of human DNA in Kenya. Informed consent for use of the samples for this study was obtained with de-identification procedures employed in accordance with recommendations from the Scientific and Ethics Review Unit at the Kenya Medical Research Institute (KEMRI). The markers amplified for the generation of the allele frequency table were D3S1358, D13S317, PentaE, D16S539, D18S51, D2S1338, CSF1PO, Penta D, THO1, vWA, D21S11, D7S820, TPOX, D8S1179, FGA, D2S1338, D5S818, D6S1043, D12S391, and D19S433. A high degree of gene diversity was observed in this population with average PIC values and heterozygosity of 0.799 and 0.831 respectively across the 20 loci. Cumulatively, 182 alleles were detected in the Kenyan population analysed across the 20 STR loci. The lowest allele frequency value was 0.003 where one occurrence of the allele was observed while the highest allele frequency was 0.36 for allele 16 marker D3S1358. Polymorphism information content (PIC) results ranged from 0.69 to 0.90 with Penta E returning the highest score. The high PIC score shows that the additional markers offer a more informative value of the genetic markers in this data set. The power of discrimination ranged from 89% to 97% with a combined power of discrimination of 99.99%. The combined match probability, a measure in population genetics that is used to measure the chance of an unrelated person, arbitrarily picked out of the common population and having an identical genotype as that derived from the reference sample or the evidence, was 4.34 x 10 -26 . The dataset generated in the present study has been demonstrated to be highly valuable in discriminating between two individual genotypes and greatly amplifies the power of discrimination available to Kenyan forensic DNA testing facilities. The loci included in this dataset comprise the commonly used loci in the US, Europe and Asia and the development of this allele frequency table increases the data sharing possibilities between local and international forensic DNA testing facilities.


Introduction
Analysis of human DNA for forensic and family relationship testing and identification purposes is currently based on testing of short tandem repeat (STR) markers also called microsatellites. These DNA sequences with a repetitive unit of between two to seven base pairs are used to generate unique DNA profiles composed of genotypes from separate genetic loci. No two individuals possess the same autosomal DNA profile except identical twins. Allele frequency data gathered from a relevant population is used to determine the statistical probability of a random match between two individuals as well as calculate probabilities of relationship between two individuals in question [1].
There are three possible outcomes of a comparison of two STR genotypes in the case of crime scene evidence examination: (a) Exclusion, where there is no match between the two genotypes, (b) an inconclusive result where the relationship between the two profiles is difficult to interpret, or (c) inclusion, where the two STR genotypes in question are a match [2]. Of the three different conclusions, only the latter requires statistical interpretation of the result. The purpose of incorporating statistics is to prove by way of a probability test that the results observed were not a chance event [1]. The match statistics used are usually provided as estimates of the random match probability or the frequency for the particular genotype in a given population [3]. There are however different approaches that may be taken in stating rarity of a match between an unknown and known sample [1].
The only available Kenyan allele frequency table currently has 15 markers comprising of; D8S1179, D21S11, D7S820, CF1PO, D3S1358, THO1, D13S317, D16S539, D2S1338, D19S433, VWA, TPOX, D18S51, D5S818, and FGA loci derived from an older 5 dye STR marker kit [4]. However, recent advances in commercial STR kit development have seen the addition of more STR markers. One of the available commercial STR kits in use, PowerPlex 21 (Promega, Madison, USA), comprises of 20 markers providing additional genotyping data and significantly increasing the power of discrimination available to a forensic analyst when calculating paternity and kinship statistics. This study aimed to generate an additional 5 markers and provide a corresponding allele frequency table that includes D1S1656, D6S1043, D12S391, Penta D, and Penta E to enable utilization of all available genotype data generated from the newer 20 marker STR amplification kits available.
While testing for efficiency of added markers, determination of heterozygosity, defined as the condition at a locus of having two different alleles, is vital [5], The higher the heterozygosity exhibited the higher the ability to distinguish samples from different individuals. When carrying out paternity testing, the paternity index (PI) is calculated to measure the likelihood that the alleged father contributed the paternal alleles observed versus an untested, unrelated, random man in the population. An alternative statistic used in paternity testing is the power of exclusion also referred to as probability of exclusion (PE), which is defined as the likelihood that the alleged father contributed the paternal alleles versus an untested, unrelated, random man in the population [6]. The power of exclusion informs the utility of the marker panel used in excluding particular genotypes [7].
The power of discrimination (PD), which has also been referred to as the random match probability is a measure of the probability or chance that two randomly selected individuals have matching genotypes (Planz, 2004). The power of discrimination informs us how powerful the loci are at individualization [7]. Finally, the polymorphism information content (PIC) of a marker is the measure of a marker's usefulness for linkage analysis between parent and offspring [8]. The PIC value is used in genetics as a common measure of polymorphism for an allele locus used in linkage analysis [9]. These statistical parameters were used in the present study to determine the utility of the new 20 marker set.

Study population
The samples used in this study were collected and archived over a period of 5 years from persons visiting the forensic lab for kinship and paternity testing. Informed consent was obtained for use of the samples for extended population studies with de-identification and the protocol was approved following ethical review by the Scientific and Ethics Review Unit-KEMRI. The source population was heterogeneous in ethnic composition and was therefore presumed to be representative of the diversity present across the Kenyan nation. The age and gender of the study participants was not considered during sample selection as these do not influence the allele frequency data. Critically, only samples from confirmed unrelated individuals were included in this study.

Sample collection
Source DNA was collected from the oral cavity of the donor using sterile foam tipped swabs (COPANTM Diagnostics Inc).
Briefly, the swab was rubbed on the inner lining of the cheek and under the tongue to collect cheek cells with saliva as the transfer medium. The cells on the swab were then transferred from the swab onto indicating nucleic acid cards (COPANTM Diagnostics Inc) by firmly pressing the swab onto the card for about 30 seconds. Proper sample transfer from the swab onto the card was achieved when the card turned from pink to white. The cards were then allowed to air dry for about 30 minutes and packaged into individual labelled envelopes and sealed pending DNA profiling. The swabs used to collect the DNA sample were then discarded as bio-hazardous waste.

PCR amplification
The DNA collected on the nucleic acid cards was genotyped using the PowerPlex21® STR kit (Promega, Madison, USA). Briefly, a 1.2mm punch from the nucleic acid card was added into a sterile PCR tube containing 25µl of reconstituted amplification mix prepared by mixing 5 µl of labelled primer mix, 5 µl of master mix and 15 µl of nuclease free water. Positive and negative controls were included in each run to verify the integrity of the PCR reaction as per manufacturer's instruction. The PCR amplification reaction was conducted on a VeritiTM thermocycler (Applied BiosystemsTM) with the thermal cycling protocol set as follows: 96°C for 1 minute, 94°C for 10 seconds, 59°C for 1 minute, 72°C for 30 seconds for 25 cycles then 60°C for 20 minutes and a final hold at 4°C.

Fragment analysis
The amplified DNA samples were separated by capillary electrophoresis on a 3500 Genetic Analyzer (Applied BiosystemsTM). A loading cocktail was prepared on a 96 well plate by combining 0.5µl of WEN ILS 500 size standard (Promega, Madison, USA) and 9.5 µl of highly deionized formamide into which 1 µl of amplified DNA was added. The samples were then centrifuged briefly to remove air bubbles. This was followed by sample denaturation at 95°C for 3 minutes and then immediate chilling on crushed ice. In order to accurately determine the sample genotypes, an allelic ladder was also included in addition to the WEN ILS 500. The allelic ladder was used to accurately size the unknown DNA fragments according to the number of repeats present while the WEN ILS 500 determines the size of the unknown DNA fragments in base pairs. The raw data obtained from the 3500 Genetic Analyzer was processed using GeneMapper® ID-X software version 1.3 (Applied BiosystemsTM) to generate individual DNA profiles for each sample analysed. The analysed data was exported to MS-Excel which enabled import onto PowerStatsV1.2 software (Promega, Madison, USA) for the generation of allele frequencies.

Data collection procedure
Autosomal STR profiles from the 180 individuals were used to generate an allele frequency table for the 20 markers using PowerStatsV1.2 software. The allele frequencies were determined by counting the number of times each allele was observed in the samples. The STR genotypes collected were converted into observed allele frequencies following the procedure outlined earlier [4]. The number of times each allele is observed in all the samples processed was counted using PowerStatsV1.2 (Promega, Madison, USA) software and divided by the total number of chromosomes tested, in this case 2 per sample. PowerStats software was also used to calculate the observed heterozygosity (Ho), power of discrimination (PD), probability of exclusion (PE), and polymorphism information content (PIC).
There were cases where some alleles were not sampled sufficiently and the allele was so rare that it was represented only once or a few times in a dataset. In such cases, it is recommended that only alleles observed at least five times be used in forensic calculations [10]. The minimum allele frequency is therefore determined as 5/(2n) where n is the number of individuals sampled and 2n is the number of chromosomes since autosomes are in pairs due to inheritance of one chromosome from each parent.
Allele independence of the resulting population data was tested through Hardy-Weinberg equilibrium test. Hardy-Weinberg equilibrium (HWE) was used to predict the stability of allele and genotype frequencies from a single generation to the next [10]. For the testing of HWE and to conduct the exact test of population differentiation, GenAlEx version 6.5 was used [11].

Results
A total of 180 samples were used in this study to generate allele frequencies across 20 microsatellite markers. A total of 182 alleles were detected in the Kenyan population analysed across the 20 STR loci ( Table 1). The lowest allele frequency value was 0.003 where one occurrence of the allele was observed while the highest allele frequency was 0.36 for allele 16 marker D3S1358 (Table 1). Since it has been recommended that each allele should be observed a minimum of five times for it be included in accurate and reliable statistical calculations (Statistics, 2013), the lowest observed allele frequency of 0.003 was replaced with the minimum allele frequency of 0.0139 in this population.
The most common alleles among the 20 STR loci typed were allele 7 for TH01, allele 8 for TPOX, allele 10 for Penta D and  Figure 1). Overall, allele 16 at locus D3S1358 was the most common in the population sampled ( Table 1). The total number of alleles found in the population tested was 182.  (TABLE 2). None of the 20 STR loci were in Linkage disequilibrium.  Table 3). The high PIC score ranging from 0.69 to 0.90 represents the high informative value of the genetic markers in this data set for linkage studies. The power of discrimination ranged from 89% to 97% with a combined power of discrimination of 99.99%. The combined match probability, a measure in population genetics that is used to measure the chance of an unrelated person, arbitrarily picked out of the common population and having an identical genotype as that derived from the reference sample or the evidence, was 4.34 x 10-26 (Table 3) which combined with the high power of discrimination demonstrate the utility of the dataset in accurately distinguishing among individuals in the population. The combined paternity index was 2.59 x 1010 which indicates how many times more likely it is that the alleged father in a paternity test is the biological father as opposed to a randomly selected unrelated man from a similar ethnic background. The loci with the highest observed heterozygosity were D18S51 and Penta D with a heterozygosity value of 92%, while locus D3S1358 had the lowest heterozygosity of 69%. The mean heterozygosity of the 20 STR loci was 83%. The high degree of observed heterozygosity corresponds to high allelic diversity and reduces the likelihood of a random match between sample genotypes.

Ethics approval and consent to participate
Informed consent was obtained for use of the samples for extended population studies with strict de-identification procedures and the protocol was approved following ethical review by the Scientific and Ethics Review Unit-KEMRI under protocol reference KEMRI/SERU/CBRD/198/3897.

Consent for publication
Not applicable.

Availability of data and materials
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.

Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author's contributions
CW was extensively involved in proposal development, data collection and analysis as well as preparation of the manuscript. EA significantly assisted in data analysis and manuscript development. WC assisted in data analysis and manuscript development. JK broadly assisted in proposal preparation and data analysis and interpretation. EL provided oversight in proposal development, data analysis, interpretation and preparation of the manuscript