The full landscape of Y-chromosomal diversity reveals complex population migration and admixture tracts
Non-crossover regions of the human Y-chromosome harbor the feature of male-specific inheritance and can record most male behavior, phenotype and human demographic details [4, 7]. To explore the patterns of Y-chromosomal diversity, we reported the genotypes of 1033 Y chromosomes randomly sampled from 33 Chinese populations belonging to five ethnic groups (Mongolian, Manchu, Hui, Gelao, Li and Han), which were genotyped using our newly-developed 639-plex Y-SNP panel and high-density Affymetrix array. We have conducted a comprehensive population evolutionary analysis and population comparison tests within and between Chinese populations belonging to different geographical regions or language families. Population genetic survey suggested that our panel captured the richest Y-chromosomal genetic diversity to date in all forensic Y-SNP genotyping tools focused on the Chinese populations [26, 27, 32, 49–53]. Phylogeny constructed among Non-Han Chinese (Mongolic, Hui and Tai-Kadai) and all included subjects consistently demonstrated that strong geography or ethnicity-related Y-chromosomal features indicated the underlying complex population evolutionary history and potential for forensic pedigree search and biogeographical ancestry inference.
HFS analysis among our studied populations or regional northern Mongolian and southern island Li people has revealed their common founding lineages of C2a/2b and O1a/1b (Fig. 1A ~ B). Geographical distribution further confirmed these dominant lineages could be used as forensic markers for genetic localization. C2a1-F3914 can be used as a Mongolian predominant founding lineage, which was observed in 42 Mongolian individuals and 26 Han Chinese individuals, mainly from Shaanxi, Shanxi and Inner Mongolia. Another Mongolian predominant lineage C2b1-F2613 was observed in 16 Mongolians (16/159), two Manchu, one Li, two Hui, 32 Han (32/693) and six Gelao individuals. Upstream C2b-F1067 (previously classified as C2c) was reported first in northeastern Asia and associated with the origin and expansion of Mongolic-speaking populations. Subclades of C2b1a1a1a-M407 (C2c1a1a1 in the previous version) appeared in ten individuals (two Hans, one Hui, one Li and four Mongolians), and C2b1b1b-F5477 and C2b1a2b2-FGC45548 were also respectively observed seven and six times. Huang et al. presented one revised phylogenetic tree and distribution map focused on all available C2b1a1a1a-M407 samples and found that C2b1a1a1a-M407 has a frequency of over 50% in the northeastern Asian population [54]. Thus, C2b1-F2613 and C2b1a1a1a-M407 can be used to trace the population origin and migration of Kalmyks, Mongolians, Buryats and other genetically close northeastern Asians.
Network and phylogeny constructions further showed that the lineage of O1a-M119 was the common paternal lineage in southern Chinese populations (Fig. 2). Sublineages of O1b1a1a1a1b2a1a1-F2517 (23), O1b1a2a1-F1759 (10), O1a1a1b-Z23420 (15) and O1a1a2a1a-Z23266 (11) had undergone population expansion recently. All F2517 lineages were observed in Li people, which was consistent with a recent whole-genome sequencing study [55]. Chen et al. found O1b1a1a was the dominant lineage in southern East Asians, which diverged from others 10998 years ago, and the F2517 sublineages were further divided into O1b1a1a1a1b2a1a1a and O1b1a1a1a1b2a1a1b clades at 2828 years ago [55]. Sun et al. also found most sublineages of M119, including our identified F1759, Z23420 and Z23266, contributed to the ancient gene pool of modern Tai-Kadai, Austronesian and southern Han Chinese [10]. We could also identify the shared paternal lineages among Li, southern Han and Gelao people in the distribution of M119 mutations. The unique paternal genetic structure of the Hainan Li people was also consistent with our previous findings of the fine-scale genetic structure [56].
Our survey has identified 195 samples of O2a1-CTS7638 and 372 samples of O2a2-P201 in Han Chinese populations. Sublineages of O2a2b1a1a1a1a1a1a1-M6539 (71), O2a1b1a1a1a1a1a1-F17 (43), O2a2b1a1a1a1a1b1a1b-MF15397 (26), O2a2b2a1b1-A16609 (24) and O2a2b1a1a1a1a1a1-F155 (23) have experienced population expansion events recently, which was consistent with population expansion among Han Chinese populations inferred from the whole-genome sequencing project. Admixture-introduced rare lineage Q1a1-F746/F790 was observed in 30 samples from Mongolian, Han, and Gelao people, and steppe pastoralist-related R1a1a1b2-Z93 was observed in 17 Mongolian and Hui people. Our results provided genetic evidence for extensive admixture between northern East Asians and surrounding populations. Similar patterns were also illuminated in the paternal genetic history reconstruction by Wang et al., who concluded that multiple ancestral sources contributed to the formation of the paternal gene pool of Mongolian people [32]. Besides, He et al. found that Mongolic-speaking populations have strong population stratifications, in which the northern one was influenced by Siberian ancestry, the western one was influenced by western Eurasian and the southern one was influenced by Han Chinese expansion [37]. Recent ancient DNA also found that western Eurasian steppe ancestry has influenced the genetic makeup of northern East Asians. For western Eurasian ancestry identified in Hui people, our paternal results were also consistent with the admixture patterns via the genome-wide SNP data. Complex demographical models suggested that geographically diverse Chinese Hui people harbored complex and different admixture processes and possessed approximately 10% ancestry related to the ancestor from Central Asians [57–59].
We also identified paternal genetic structure among Chinese populations in the clustering patterns via HFS, PCA and MDS. These population stratifications were in accordance with the language or geographical categories. Island Li people shared their specific paternal genetic structure and clustered far from other Chinese populations. Similarly, Mongolian people clustered together and have a relatively close genetic relationship with northern East Asians. We also should note that the Y-chromosome-based population structure in China is rougher than that inferred from the genome-wide SNPs. Our recent genetic studies have identified five population substructures correlated with languages and geography in China. Mongolic and Tungusic people in Northeastern China harbored the highest Ulchi or ancient Boisman/DevilsCave-related ancestry [37, 41]. Tibeto-Burman groups from Tibetan Plateau had the highest proportion of ancestry related to core Tibet Tibetan and Nepal Chokhopani, Mebrak, and Samdzong people[60, 61]. The primary ancestral component of Hmong-Mien people from southwestern China was maximized in ancestry related to Miao and Yao people [62], and Austronesian-speaking people from Taiwan Island have more Ami/Atayal or ancient Hanben-related ancestry [41]. Han Chinese ancestry localized between the four ancestries mentioned above and showed a northern-to-southern genetic cline. The recent large-scale genetic structure also identified fine-scale population structure among geographically diverse Han Chinese populations [46, 47, 63]. These fine-scale genetic backgrounds could promote better study design for large medical clinical cohorts and forensic genetic localization of crime cases.
639-plex Y-SNP panel can be used as a powerful forensic tool for Chinese forensic pedigree search and biogeographic ancestry inference
The forensic community has noticed that whole-genome sequencing in a forensic case needs to overcome specific infrastructure of the specialist, platform and genomic statistical methods, as well as the experiment method focused on the forensic case samples. Besides, the cost of one sample is another important obstacle to the wide application of whole-genome sequencing technology in forensics. Evolutionary genetic scientists have conducted many vital projects to explore the complete anthropologically-informed phylogeny [7, 19, 20]. Our work has identified most paternal founding lineages in Chinese populations and comprehensively characterized their geographical and ethnic distribution. Our panel harbored the high coverage of genetic variations of terminal lineages and complete coverage of reference data from main populations or ethnic groups in China.
Forensic pedigree search can help trace possible crime suspects based on the shared Y-chromosome mutations. Lineages informative Y-SNPs were usually used together with Y-STR markers. Many prior works provided relatively high-resolution forensic phylogenic trees and presented the corresponding scientific examination and analysis strategies. They promoted the advances of forensic Y-chromosome applications in pedigree search and biogeographic ancestry inference based on the customized SNaPshot and NGS technologies [26, 27, 32, 49–53]. Song et al. explored the paternal genetic structure of Hainan Li using their developed panel containing 141 Y-SNPs. They found haplogroup O1b1a1a1a1a1b-CTS5854 can be used as one ethnicity-specific lineage in population and forensic genetics [51]. Song et al. further updated their panel, including 233 Y-SNPs used for Chinese Qiang people, and found that O2a2b1a1-M117, O2a2b1a1a1-F42 and O2a1b1a1a1a-F11 were the founding lineages in Qiang people [52]. Wang et al. also investigated the paternal genetic structure of Zhuang people using this panel and identified the O2-dominant lineages in Tai-Kadai people [49]. Xie et al. developed one panel focused on Hui people, which included 157 Y-SNP, and identified the population substructure of Hui people [26]. Wang et al. focused on the genetic diversity of Mongolian people (N1b-F2930, N1a1a1a1a3-B197, Q-M242, and O2a2b1a1a1a4a-CTS4658) and developed one Mongolian-specific panel included 215 Y-SNPs [32]. The panels mentioned above consisted of several customized SNaPshot systems, which limited the quakily used in forensic cases. Wang et al. developed one 165-plex Y-SNP panel based on an Ion S5 XL system. They comprehensively conducted the sequencing performance and concordance, reliability, sensitivity, and stability studies based on the ISFG guidelines [27]. Liu et al. recently updated this system by increasing the final Y-SNP number to 256 [53]. Significantly, Tao et al. developed a customized SifaMPS 381 Y-SNP panel that included 381 Y-SNPs focused on Chinese populations and investigated the basic structure and subbranches of Chinese major haplogroup branches [50]. Our panel included two significant features: the first one is lineages specific to or common in most Chinese populations were included (O, D, C, R, and Q et al.), and the other important one is that we retained a higher resolution of the terminal Y-chromosome lineages, which can complete the shortcoming of previously developed panels limited to some common lineages or only focused on specific populations. The newly-developed panel overcame the limitation of the lineage's representatives, terminal lineage resolution and sequencing platforms, which can provide the best practice tool in forensic applications. The identified paternal population structure in China can provide more clues for biogeographical ancestry inference, which can be used as a complementary tool for forensic ancestry prediction based on the autosome-based ancestry informative SNP panel [57].