Genetic Polymorphisms and Forensic Eciency of 16 X-chromosomal STR Loci for Sri Lankan Population

A new 16 X- short tandem repeat (STR) multiplex PCR system has recently been developed for Sr Lankans, though its applicability in evolutionary genetics and forensic investigations has not been thoroughly assessed. In this study, 838 unrelated individuals covering all four major ethnic groups (Sinhalese, Sri Lankan Tamils, Indian Tamils and Moors) in Sri Lanka were successfully genotyped using this new multiplex system. The results indicated a high forensic eciency for the tested loci in all four ethnicities conrming its suitability for forensic applications of Sri Lankans. Allele frequency distribution of Indian Tamils showed subtle but statistically signicant differences from those of Sinhalese and Moors, in contrast to frequency distributions previously reported for autosomal STR alleles. This suggest a sex biased demographic history among Sri Lankans requiring a separate X-STR allele frequency database for Indian Tamils. Substantial differences observed in the patterns of LD among the four groups demand the use of a separate haplotype frequency databases for each individual ethnicity. When analysed together with other 14 world populations, all Sri Lankan ethnicities except Indian Tamils clustered closely with populations from Indian Bhil tribe, Bangladesh and Europe reecting their shared Indo-Aryan ancestry.


Introduction
Sri Lanka is an island country in South Asia, located in the Indian Ocean, close to India. Due to its strategic position at middle of the maritime silk route from China to Europe, it was well known to the outside world from ancient times as a trading hub. The diverse ethnicities that compose the 20 million population inhabiting the island as per the last population census 1 have descended mainly from numerous groups of migrants who came to the island at various historical time periods. Their overpowering impact have con ned the original inhabitants of the country to a few dry zone areas, forming a tribal group known as the Veddahs (aboriginals) 2 , represented by about 10,000 individuals 3 .
Today, the Sinhalese make the largest ethno-cultural group in Sri Lanka, having a population of 15.17 million (74.9% of the total population) 1 . The Sinhalese make a unique population in the world as the only ethnic group that speaks Sinhala, a branch of the Indo-European (Indo-Aryan) language family 4 . According to historical chronicles, the Bengali prince Vijaya and his seven hundred followers, who are descendants from the Indo-Aryans natives of the northern Indian subcontinent laid the foundation to the Sinhalese in 543 BCE 4 . They vanquished the aboriginal Veddas and converted Sri Lanka to a Sinhalese territory until the Dravidian rulers from South India invaded the northern part of the island in the fth century AD 4 . Since then, there had been frequent migrationsF by South Indians into the country, which gave rise to the second largest ethnic group in Sri Lanka known as the "Ceylon Tamil" (2.27 million people representing 11.2%) 1 .
A third ethnic group was established in the country when Arab traders visiting the country for commercial purposes settled in Sri Lanka in 1000 AD, leading to intermarriages with the Sinhalese and the Sri Lankan Tamils 4 . The group now known as Moors comprise 9.2% 1 (1.86 million) of the Sri Lankan population and maintain unique sociocultural features which are based largely on the Islamic faith. They speak a Dravidian language that contains large number of Arabic words that is generally referred to as ''Arabic Tamil'' 5 . However, some scholars attribute the origin of Moors to South Indian traders, who later settled in Sri Lanka 5 . This view in part is based on the similarities shared by Sri Lankan Moors with the Tamil Muslims of Tamil Nadu. Indian Tamils make the 4th largest ethnic group in Sri Lanka and comprise the descendants from plantation workers brought to Sri Lanka from South India by the English rulers who colonized Sri Lanka in 19th century 4 . Comprising a relative minority of 0.84 million people (4.2% of the total population) 1 , Indian Tamils are chie y con ned to the central hills in Sri Lanka with a relatively low admixture with other ethnic groups due to socio cultural reasons associated with their more recent immigrant status.
In addition to these ethnicities, around 0.5% of the Sri Lankan population comprises other minor ethnic groups belonging to numerous descents. They include Malays (descendants from island of Java ) Burghers (descendants of colonists from Portugal, Netherlands and UK) and other Chinese and African migrants who came to the island in the 18th and 19th centuries.
Because of the demographic history and gene ow, the genetic position occupied by each of these ethnic groups, both at local and global scales is not clear. The wide array of genetic markers currently available provide an opportunity for reliably elucidating their genetic a nities. However, depending on the type of genetic markers used, the ancestral information of populations deduced from the analyses could be quite different. On the other hand, each of these different approaches can complement each other with their characteristic genetic information. Although a few previous studies that had been conducted to understand the genetic substructure and underlying heterogeneity of Sri Lankan ethnicities using autosomal 6 , Y chromosomal 7 and mitochondrial markers 8 , none has utilized those present on the X chromosome. X-chromosome markers, particularly short tandem repeats (STRs), with the advantageous features of both autosomal and uniparental biomarkers, play an important role in evolutionary studies 9,10 as well as in forensic genetics 11 .They assist in the interpretation of complex kinship cases on its own or in conjunction with other marker like autosomal STRs. Analysis of X chromosome STRs (X-STRs) is specially advantageous in complex cases, where at least one female is involved, such as in a de cient paternity case of a female child, cases involving female siblings sharing a common biological father, questioned relationships between paternal grandmother-granddaughter or other distant female relatives 12 . They are also of special importance in forensic case work in which female traces are to be identi ed in male background contamination 12,13 . The use of clusters of tightly linked X-STRs forming highly informative haplotypes is particularly effective in such cases 13 . In addition, X-chromosome markers have a proven utility in tracing the sex biased demographic history among populations with complex admixture and gene ow patterns. Consequently, X-chromosome markers have gained signi cant importance in population and forensic genetic studies in the past two decades. However, the routine practice of molecular genetics in Sri Lanka presently comprise only of autosomal, Y-chromosomal and mitochondrial DNA analyses. The scope of X-chromosome markers is yet to be investigated for Sri Lankans.
Recognizing this vital need exist in the eld of molecular genetics in Sri Lanka, we recently developed a multiplex X-STR system with 16 X-STR markers distributed from 9.198 Mb to 149.460 Mb of the X chromosome 14 with the aim of incorporating X-STR analysis in to molecular forensic and evolutionary genetics investigations in the country. Thirteen of these STR markers are in four closely linked clusters (each spanning < 3 cM) that are likely to produce stable haplotypes (Cluster I; DXS10148-DXS10135-DXS8378 (Xp22), Cluster II; DXS7132-DXS10079-DXS10074-DXS10075 (Xq12), Cluster III: DXS6801-DXS6809-DXS6789 (Xq21), Cluster IV; DXS7424-DXS101-DXS7133 (Xq22)). Additional three unlinked markers were also included from both p (DXS9902 at Xp22) and q (HPRTB at Xq26 and DXS7423 at Xq28) arms to have a better coverage of the X chromosome. The assay was validated for the Sinhalese using 200 unrelated individuals of which 120 were males. In the present study, we extended the analysis to evaluate the forensic e ciency of this novel 16 X-STR assay to all four major ethnicities in Sri Lanka (Sinhalese, Sri Lankan Tamil, Indian Tamils and Moors) and constructed an allele and haplotype frequency database for Sri Lankans for forensic and kinship analysis purposes. Further, we have conducted a comprehensive analysis of the possible linkage disequilibrium (LD) among the selected X-STR markers, which is essential in making valid conclusions on existing relationships. Here, we report for the rst time, the X-STR based population genetic information of all four main ethnicities in Sri Lanka. In addition, Pairwise genetic distances based on Fst were also calculated between the Sri Lankan population and populations from South, South East and East Asia, Europe, Africa and Brazil based on the data extracted from literature to elucidate the genetic substructure between Sri Lankan and other global populations.

Polymorphism
We typed 16 X-STR loci for 838 unrelated individuals of the Sri Lankan population covering the four major ethnicities. Complete DNA pro les were obtained for all male samples without any allele dropouts. Altogether 203 alleles were observed for all four populations with 30 private alleles (Sinhalese: total alleles (A T ) = 184; private alleles (A p ) = 17, Sri Lankan Tamils: A T =164; A p =5, Indian Tamils: A T =156; A p =5 and Moors: A T =150; A p =3). When female samples were tested for the conformity to Hardy-Weinberg equilibrium, no signi cant deviations were observed for any of the tested loci after adjusting for multiple comparisons (Bonferroni corrected P = 0.0031) (Supplementary Table S1). The exact test of population differentiation did not detect signi cant differences in allele distribution among the male and female samples and hence the allele frequencies were combined for both sexes for further analysis. from 0.3548 to 0.8803. Among all the tested loci, DXS10135 showed the highest value for all forensic parameters, suggesting it to be the most informative marker for Sri Lankans, while DXS7423 showed the lowest and the least informative. It is also interesting to note that the least polymorphic marker, DXS7423, had two of its alleles (alleles 14 and 15) presented in about 80% of the studied samples. Nevertheless, the combined power of discrimination for the 16 tested loci in both males (CPD m >0.999999999958569) and females (CPD f =1.000000000000000), as well as combined MEC indices calculated for de ciency (CMEC Kru >0.999998849524199), normal trio (CMEC Des−trio >0.999999999678351), and duo cases (CMEC Des−duo >0.999999800357383) were equally high for all four ethnicities (Table 9a).     Table 9 Combined forensic e ciency parameters calculated for the 16 X-STR loci in four ethnic group based only on allele frequency data (a) and on both allele frequency and haplotype frequency data for those loci in LD (b) For those loci which were found to be in LD, it is more appropriate to calculate the forensic e ciency based on the observed haplotype frequencies instead of individual allele frequencies, since each haplotype is assumed to behave as an allele. Accordingly, forensic e ciency was recalculated for the X-STR loci for those ethnicities which exhibited LD (see section below on LD and haplotypes) based on their particular haplotype frequencies (Table 10). The combined forensic e ciency parameters for the 16 X-STR were also recalculated based on the e ciency of both individual markers and haplogroups (for those markers in LD). For this purpose, the 16 markers were treated as 12 loci for Sinhalese, while they were treated as 15 loci for Indian Tamils and Moors. For Sri Lankan Tamils, recalculation was not necessary as there was no LD detected within the 16 loci tested. As shown in   (Table 11) to understand the extent of genetic differentiation among them. Out of the four populations, Sinhalese are known to have an Indo-Aryan origin, which is different from the Dravidian linguistic origin of the other three ethnicities. However, a signi cant variation was not detected among the two linguistic groups (Fct= -0.00059; P > 0.05) though a subtle, but statistically signi cant variation was detected among populations within groups (Fsc = 0.0018; P = 0.0108). The global AMOVA results as a weighted average over loci showed that most of the variance in the samples is attributable to within-individual variation (97.97%) and between ethnic group variation is around 0.18%. To better understand this observed population structure, pairwise comparisons (pairwise Fst analysis) were carried out among all four ethnicities (Table 12). According to the Fst values obtained, Indian Tamils were shown to have a subtle but statistically signi cant genetic subdivision from Sinhalese (Fst = 0.0029; P = 0.0000) and Moors (Fst = 0.0038; P = 0.0000) while Sinhalese, Sri Lankan Tamils and Moors are highly panmictic (P > 0.05). Further, the two Tamil ethnicities were shown to share a common genetic background (P > 0.05). Since phylogenetic trees constructed from genetic distances can easily deduce the evolutionary relationships and origins of different populations 15,16 , UPGMA method was applied for the four ethnicities to further evaluate their genetic a nities. As shown in the phylogram (Fig. 1), Sinhalese and Moors are genetically closely associated with each other and also with Sri Lankan Tamils to a lesser extent. Although Indian Tamil group is placed at a distant position from Moors and Sinhalese, the close genetic a nity between the two Tamil groups are apparent in the phylogram. Nei genetic distances for the six pairwise ethnic groups are listed in the Table 13. These ndings agree with the historical data on early settlement of the four ethnic groups in Sri Lanka. According to anthropological and archaeological evidence, Sri Lankan Tamils have a very long history in Sri Lanka and have lived in the island since at least around the second century BCE. They have arrived in Sri Lanka from various parts of the Indian subcontinent, either with their original families or alone, and subsequently uniting with the Sinhalese through matrimonial bonds. Indian Tamils on the other hand were brought to Sri Lanka to work in estates during the British colonization and had minimum admixture with the native Sinhalese or with Sri Lankan Tamils, who were more economically independent at the time and had better social status. Even at present, majority of Indian Tamils are congregated around the plantation estates of central hill area of the country forming a separate community. In contrast, Sri Lankan Moors have descended exclusively from Muslim male merchants of either Arabic or of Indian origin 5 , who came to Sri Lanka for trading. During the 14th century, they started to settle in coastal areas in Sri Lanka and espoused local women, who were either Sinhalese or Sri Lankan Tamil. Thus, it is not surprising to see this local female ancestry re ected among Moors via our X-STR analysis, despite their Arabic origin, in the light that the X chromosome spends two third of its lifetime within females. Alternatively, the genetic similarity observed between Sri Lankan Tamils and Moors may be re ecting the Indian origin of Moors as some scholars claims it to be 5 . Likewise, the genetic similarity observed between the two Tamil populations might also lie in their common Indian origin.
Since the X chromosome re ect more of the maternal blood line, results reported in mitochondrial (mt) DNA studies are also of much relevance to the present context. These results suggest that allele distribution of the four ethnic groups of Sri Lanka is very similar to the two tested Indian populations and the Bangladesh population. Pakistani population also exhibit substantial similarity to Sri Lankan ethnic groups, though not to the same extent of Indian and Bangladesh populations. On the contrary, allelic distribution of many X-STR loci in Sri Lankan ethnic groups differ from Southeast Asian, East Asian, European and African populations. Among them, East Asian and African populations are the most genetically distant populations to Sri Lankans.
To further clarify the relationship between Sri Lankans and the world populations, pairwise Fst values were averaged over eight of the 16 studied X-STR loci (DXS10148, DXS10135, DXS8378, DXS7132, DXS10079, DXS10074, HPRTB and DXS7423) and were represented in a multidimensional scaling (MDS) plot (Fig. 2) 43 ). According to the results observed, Sri Lankans were clustered together not only with Indians and Bangladeshi, but also with Europeans. Indian Tamils were placed towards the periphery of this main cluster, while Southeast Asians, East Asians and Africans were placed at a distant, outside the main cluster.
This presentation of MDS plot aligns well with the historical claims of population movements in Eurasia. Sinhalese are believed to be descended from Indo-Aryans, who set forth from boarders of Caspian and Black sea towards Europe and South Asia, early in the third millennium BC. Accordingly, many scholars hold the view that along with the Indo-Aryan language family, many Europeans and South Asian civilization of today share common genetic background re ecting their Bronze age common ancestors. Tamils on the other hand are believed to have descended from the indigenous people of Indian subcontinent. However, Sri Lankan Tamils have admixed with Sinhalese nearly over two millennia, unlike the Indian Tamils, which might explain their relative positions in the MDS plot.

Ld And Haplotype Analysis
Population studies of various countries have illustrated that the LD is population speci c 41,45,46 and does not necessarily present between markers with close physical proximity 47,48 . In the present study, Sri Lankan Tamils did not exhibit LD within any of the four clusters, while Sinhalese displayed LD within three of the four studied clusters after adjusting for multiple comparisons (Table 14). Further, in cluster I and IV, LD was detected only among Sinhalese population, and only between a single pair of loci (DXS10135 & DXS8378 in cluster I and DXS7424 & DXS101 in cluster II). Among these markers, DXS10135 & DXS8378 of cluster I are separated by 13.1 kb. However, there was no LD detected between DXS10148 & DXS10135 (of the same cluster), which are located only 1 kb apart. On the other hand, the cluster II, which had the highest presentation of LD among the four clusters, displayed a highly signi cant (P = 0.0000) LD between two of its adjacent markers, DXS10074 and DXS10075 in three ethnicities (Sinhalese, Indian Tamils and Moors). In addition, LD was also detected between DXS7132 and DXS10075, the outermost two of the four markers of cluster II, among the Sinhalese (P = 0.0000). In contrast, in cluster III, none of the markers showed a signi cant LD (corrected P = 0.0167), although there was a marginal LD (P = 0.0174) detected between DXS6801 and DXS6809 for Sinhalese population.  DXS10079 and DXS10074). The marker, DXS10075, in which LD was detected with DXS7132 for Sinhalese was not investigated in any of these populations to make a comparison. However, it is noteworthy that all these studies were conducted with male samples between 100 and160, which might have posed a limitation in detecting true LD in these populations. On the other hand, LD was not observed for cluster III or cluster IV in a study that investigated 302 Pakistan males 20 , the only Asian population for which LD data are available for these clusters. These data are suggestive of the existence of a complex pattern of LD among Asians. In contrast, a high level of LD was reported within these four clusters for Europeans like Swedish 51 and German [52][53][54] populations, in studies conducted with 450-800 male individuals.
The genetic stability of the linked clusters and the degree of dependence between them can have a major in uence on most forensic and kinship applications. Therefore, in addition to the pairwise LD analysis within separate clusters, pairwise LD was also analyzed between markers belonging to cluster III and IV, due to their relative close physical proximity (6.78 cM) compared to the other clusters. However, LD was not detected between any of the marker pairs (P > 0.05 for eight out of nine marker pairs) as shown in Table 14 con rming the independent nature of the two clusters irrespective their physical location.
The haplotype frequencies obtained for the four ethnic groups for all four clusters are listed in Supplementary Tables S7-S10. In most of the previously published population studies, cluster 1V comprises only DXS7424-DXS101 without DXS7133 [54][55][56] . Likewise, the four markers included in the cluster II had been analyzed in two different combinations; i.e. DXS7132-DXS10079-DXS10074 (Argus X-12 kit) and DXS10079-DXS10074-DXS10075 53 . Therefore, to allow comparisons with these previous studies, the haplotype frequencies of above combinations are also listed in Supplementary Tables S11-S14. Among the four clusters, both clusters I and II proved relatively more informative for all four ethnicities as re ected by the haplotype diversity (HD = 0.9964-0.9987 and 0.9935-0.9973 respectively). In cluster I, this observation may have caused by the two highly polymorphic markers, DXS10135 and DXS10148 as described above and is consistent with other previously published population data 17,19,24 . Cluster II, on the other hand carries four markers compared to the other clusters, which consists of three markers each. This increased number of markers may have generated higher haplotype diversity with respect to cluster II among the studied populations. On the contrary, cluster III and cluster IV are equally informative (HD = 0.9873-0.9908 and 0.9882-0.9923 respectively), though not to the same extent as the clusters I and II. Nevertheless, in general, all four clusters showed a high haplotype diversity for all four ethnicities

Conclusion
In this work, we report for the rst time, X chromosome based population genetic data for all four major ethnic groups in Sri Lanka which covers 99.5% of the total population. According to our results, the 16 X-STR assay system used in the current study is highly polymorphic and exhibited high forensic e ciency for all four ethnicities tested indicating its suitability to be used in both evolutionary genetic analysis and forensic applications of Sri Lankans. The present study has also revealed subtle but statistically signi cant differences in X-STR based allele frequency distribution of Indian Tamils with Sinhalese and Moors, contrary to the highly homogeneous genetic outlook portrayed by autosomal STR analysis. While suggesting a sex biased demographic history for Sri Lankan ethnicities, this observation recommends the use of a separate X-STR allele frequency database for Indian Tamils for forensic and kinship application purposes. In contrast, the observed genetic admixture present within Sinhalese, Sri Lankan Tamil and Moor ethnicities suggest the possible use of a common allele database for the purpose. Further, the genetic distances observed among the Sri Lankans and other nationalities in the world visualized in the MDS plot render evidence to the ancient linguistic origin of Sri Lankan ethnicities-Indo-Aryan origin of Sinhalese and Dravidian origin of Tamil populations-which had been later affected to various degrees through genetic admixing between them. LD was detected along the X chromosome in all ethnic groups except Sri Lankan Tamils, which need to be considered during the likelihood calculations of kinship resolution and person identi cation. Further, the patterns of LD observed, which differ substantially among the four ethnicities request the use of different haplotype frequency databases for the four ethnicities for forensic purposes. Although the results of haplotype analysis suggest that the four studied X-STR clusters can provide a powerful tool for kinship testing and relationship identi cation of Sri Lankan ethnicities, a more exhaustive sampling of the two Tamil groups and Moors is recommended to con rm the LD and haplotype based observations.

Sample preparation and DNA extraction
The study was conducted with the approval of the Ethics Review Committee, Institute of Biology, Sri Lanka (ERC IOBSL 135 11 15) and the study was performed in line with the principles of the Declaration of Helsinki. Written informed consent was obtained from all individual participants included in the study. Finger pricked blood samples were collected from 838 unrelated individuals from the four ethnic groups in the Sri Lankan population; 426 samples from Sinhalese (60.6% males), 154 samples from the Sri Lankan Tamils (50% males), 128 samples from Indian Tamils (49.2% males), and 130 samples from Sri Lankan Moors (51.5% males). Genomic DNA was extracted using