Genomic scans for selective sweeps with four complementary statistical 1 tests on 14 indigenous sheep breeds from Middle East and South Asia 2

The performance and productivity of livestock have consistently improved by natural and artificial selection over the centuries. Both these selections are expected to leave patterns on the genome and lead to changes in 29 allele frequencies, but natural selection has played the major role among indigenous populations. Detecting 30 selective sweeps in livestock may assist in understanding the processes involved in domestication, genome 31 evolution and discovery of genomic regions associated with economically important traits. We investigated 32 selection signals in this study using SNP genotype data of 14 indigenous sheep breeds from Middle East and 33 South Asia, including six breeds from Iran, namely Iranian Balochi, Afshari, Moghani, Qezel, Zel, and Lori- 34 Bakhtiari, three breeds from Afghanistan, namely Afghan Balochi, Arabi, and Gadik, three breeds from India, 35 namely Indian Garole, Changthangi, and Deccani, and two breeds from Bangladesh, namely Bangladeshi 36 Garole and Bangladesh East. The SNP genotype data were generated by the Illumina OvineSNP50 37 Genotyping BeadChip array. We applied four complementary statistical tests, F ST (fixation index), xp-EHH 38 (cross-population extended haplotype homozygosity), Rsb (extended haplotype homozygosity between- 39 populations), and FLK (the extension of the Lewontin and Krakauer) to detect selective sweeps. Our results 40 not only confirm the previous studies but also provide a suite of novel candidate genes involved in different 41 traits in sheep. On average, F ST , xp-EHH, Rsb, and FLK detected 128, 207, 222, and 252 genomic regions as 42 candidates for selective sweeps, respectively. Furthermore, overlapping candidate genes were detected by 43 these four tests, including; TNIK, DOCK1, SFMBT1, SPO11, USH2A, TYW1B, NELL2, EML5, and IQCE. The first six of these genes, especially TNIK, DOCK1, and SPO11 were enriched by Gene Ontology (GO) 45 and are involved in embryonic development, immune response, and fertility respectively. Knowledge of candidate genomic regions in sheep populations may facilitate the identification and potential exploitation 47 of the underlying genes in sheep breeding.


Introduction
Bangladesh East (BGE) (Sempéré et al. 2015). Information on these 14 breeds is summarized in Table 1

103
The genotype data from different breeds were merged using PLINK (Purcell et al. 2007). We excluded the 104 SNPs located on sex chromosomes and those with unknown chromosomal position. The quality control was 105 performed using PLINK (Purcell et al. 2007). SNPs that were genotyped in less than 90% of the animals, 106 had a minor allele frequency (MAF) lower than 1%, or departed from Hardy-Weinberg proportions at a P-107 value < 10 -3 were discarded. Furthermore, individuals with more than 10% missing genotypes were removed 108 from the data set. After quality control, we used Beagle V5.0 software to impute sporadic missing genotypes 109 (Browning & Browning 2007). The fcGENE V1.7 software was used to convert the PLINK formatted files 110 to Beagle format and vice versa (Roshyara 2014). Genetic diversity and population structure Individual genetic distances for the 14 sheep breeds were represented by a neighbor-joining tree and 116 displayed using VCF-kit V0.1.6 (Cook & Andersen 2017) and FigTree.v1.4.4 (Rambaut 2015).

117
We performed a principal component analysis (PCA) to investigate the population structure and to check 118 whether samples for a breed came from a homogeneous population. PCA was done for the 14 sheep breeds 119 using the smartpca program, which is part of EIGENSOFT 7.2.1 (Price et al. 2006).

120
Linkage disequilibrium (mean of r 2 ) among SNPs was estimated for the breeds using PopLDdecay V1.01 121 software, and a Perl script was applied to visualize the results (Zhang et al. 2019).

124
We analyzed ancestry using ADMIXTURE v1.3.0 to infer breed origins and quantify the populations' 125 admixture (Johnston et al. 2015). For a priori defined ancestry component (K), individual ancestry 126 proportions were calculated with ADMIXTURE v1.3.0, which was an assumption of the number of ancestral 127 populations (Alexander et al. 2009). Using 14-fold cross-validation for K values ranging from 2 to 14, 128 admixture analysis was performed. To identify the most likely number of ancestral populations, the lowest 129 14-fold cross-validation error was applied. Finally, the admixture graphs were visualized using the R package 130 BITE (Milanesi et al. 2017).  (Table 1). Therefore, we compared pairwise these three categories increasing differentiation using VCFtools V0.1.15 (Danecek et al. 2011). For each comparison, the mean of 143 FST value was computed in all 39348 SNPs. Z transformation of the mean of FST values (Z(FST)) was 144 performed using the "scale" command in R software.

146
FLK test is an extension of the original Lewontin and Krakauer (LK) statistic (Bonhomme et al. 2010).

147
It calculates a population differentiation statistic, which includes a kinship matrix representing the 148 relationship between populations (Weigand & Leese 2018).

149
This test accounts for population structure and differences in the effective population size by modeling the 150 genetic divergence between populations as a result of drift and population division (Bertolini et al. 2018).

151
For FLK analyses, p-values for a) IR vs IN, b) IR vs AF, and c) IN vs AF were computed as explained in the 152 hapFLK software documentation (Fariello et al. 2012). For each comparison, the negative log p-value was 153 calculated using the hapFLK R script (Fariello et al. 2012), and the candidate genomic regions under 154 selection were plotted.

156
Haplotype-based procedures to study genome-wide patterns of divergence studies, in comparison with SNP-157 based approaches, have an advantage of avoiding ascertainment bias towards common variants in SNP array 158 design (Browning & Weir 2010;Randhawa et al. 2014).
extended homozygosity based haplotype with other haplotypes at the selected locus (Sabeti et al. 2007).

161
Complete selective sweeps can be approached by using the cross-population EHH (xp-EHH) test, which 162 compares each population regarding corresponding haplotypes to the other populations. The xp-EHH test 163 compares the integrated EHH profiles between two populations at the same SNP (Sabeti et al. 2007). The 164 xp-EHH test has a high power to detect selection signatures in small sample sizes, and therefore grouping of 165 genetically similar breeds may help in gaining power (Sabeti et al. 2007;Pickrell et al. 2009).

166
Rsb test to identify selective sweeps is based on the same idea of estimation of EHH as xp-EHH test. However 167 in contrast to xp-EHH test, it does not require phasing information (Weigand & Leese 2018). Rsb involved 168 the comparison of the EHH patterns of the same allele (referred to as 'iES') between two populations when 169 EHH is compared between alleles at one SNP within a population (Tang et al. 2007).

170
In this study, for a) IR vs IN, b) IR vs AF, and c) IN vs AF, we used the xp-EHH and Rsb approaches (Sabeti 171 et al. 2006;Tang et al. 2007) to determine selected alleles with higher frequency than expected according to 172 their haplotype length to obtain recent and generally segregating selective sweeps. The haplotypes were 173 phased with Beagle (Browning & Browning 2007), and then xp-EHH and Rsb scores were calculated for 174 each haplotype within a population. Haplotype frequencies were computed for 39348 SNPs. For each locus, 175 the xp-EHH and Rsb score were calculated using the rehh package (Gautier & Vitalis 2012) in R and the 176 candidate genomic regions under selection were obtained.

177
For each test, the genes that were considered as candidates were found within the intervals spanning the 178 candidate genome regions and also overlapping candidate genes among the tests were captured using the 179 Ovis Oar_v4 reference genome assembly in the Ensembl (Zerbino et al. 2017

183
The biological enrichment and functional annotation of the genes under selective pressure were defined using 184 Gene Ontology Consortium (http://geneontology.org).

187
Populations and Genotype Data

188
After quality control and imputation of missing genotypes from 463 individuals genotype data for 39531

292
The FST and FLK tests with average 128 and 252 genes showed the minimum and maximum captured genes 293 among these four tests. Furthermore, five, six and three concordant genomic regions for a) IR and IN breeds,

303
The overlapping candidate genes for b) IR and AF breeds include; Echinoderm microtubule-associated 304 protein-like 5 (EML5) on chromosome seven, may change the assembly dynamics of microtubules to make 305 microtubules are slightly longer but more dynamic and it is possible that Eml5 plays a role during neuronal

316
We also detected overlapping candidate genes for IR Vs IN, IR Vs AF, and IN Vs AF data on the FST, xp-317 EHH, Rsb, and FLK tests (Figure 9). For the FST test PPA2, involved in the immune system, and KCNIP4 318 plays important role in heart performance and it is related to skeletal muscle growth and also immune 319 response. SYT1, associated with feeding behavior traits such as residual feed intake and TMEFF2, involved 320 in a wide range of traits such as; immune response, milk production and sperm morphology, were detected 321 as overlapping candidate genes for the Rsb test. For the FLK test, PPA2, EML5 genes, which have been 322 found in the previous tests, MGAT5, associate with dry matter intake and NEB, involved in environment 323 adaptation, were detected as overlapping candidate genes on the three different data. We did not find any        (Kullo & Ding 2007). The older selection events between populations are 420 expected to be identified by FST (Ma et al. 2014;Maiorano et al. 2018). The xp-EHH test is an extension of 421 EHH (Sabeti et al. 2007), that incorporates information on the relationship between an allele's frequency and low ascertainment bias sensitivity (Tang et al. 2007). The Rsb test is population comparison test to identify 424 selective sweeps (Tang et al. 2007). The test is based on the same idea as the XP-EHH, identifies loci similar 425 to the XP-EHH test under selection, but can be implemented with unphased data (Weigand & Leese 2018).   Table S1, S2, and S3). This indicates 458 that the AF breeds, which are from a hot dry climate, are more adapted to heat tolerance.

459
Many of the genes identified as candidate genes in this study are effective in genetic resistance to disease 460 and immune response. Since genetic resistance against diseases and harsh environmental conditions is one 461 of the important characteristics of indigenous animal breeds, the identification of a large number of genes in 462 this study indicates that the associated genes have been under selection pressure over time due to the natural 463 selection of immune response traits (Scarpa et al. 2003;Mwai et al. 2015). For example, we detected the proposed to be candidate genes for milk and fat production in cattle (Nafikov et al. 2013;Chen et al. 2018).

477
All of these genes were found in b (IR and AF), and c (IN and AF) clusters, indicating AF breeds may be 478 under selection pressure related to milk traits but it needs further research to conclude. (Supplementary Table   479 S1, S2, and S3).

480
We found several candidate genes involved in body weight and growth traits specially post-weaning gain in 481 all population clusters, such as the TRHD, UBR2, GRM2, GRM3 (Zhang et al. 2013;Gebreselassie et al. 482 2020).

483
The TBC1D10B, and BMPR1B are two examples of candidate genes involved in fertility traits which we 484 detected in this study and are consistent with previous studies on sheep and cattle (Höglund et al. 2015;485 Gebreselassie et al. 2020).

497
Two of the genes (DOCK1 and TNIK) play important roles specially in resistance against diseases. DOCK1 498 located on chromosome 22 is involved in immune response (Laurin et al. 2008;Kunimura et al. 2020). TNIK

507
In total, seven unique candidate genes were detected for IR vs. IN, IR vs. AF, and IN vs. AF comparisons by 508 FST, Rsb, and FLK analysis, but no overlapping candidate gene was found for the xp-EHH method ( Figure   509 9). PPA2 on chromosome 6 is associated with immune response and disease resistance in cattle (Brym & 511 Kamiński 2017). KCNIP4 gene on chromosome 6 is directly involved in processes related to muscle growth 512 and fat deposit in sheep (Pasandideh et al. 2018) and was reported in cattle involved in bovine growth and 513 calcium metabolism (Smith et al. 2019). SYT1 gene on chromosome 3 is associated with feeding behavior 514 traits (Pattaro et al. 2010), and TMEFF2 gene on chromosome 2 is involved in a wide range of traits such as; 515 immune response, milk production and, sperm morphology ( Richardson et al. 2016;Zhang et al. 2020). For 516 the FLK test, PPA2, EML5, MGAT5, and NEB genes were detected, which PPA2 and EML5 have been 517 found in the previous tests ( Figure 9). MGAT5 gene on chromosome 2 is associate with dry matter intake in 518 cattle (Seabury et al. 2017), and NEB gene on chromosome 2 is involved in environmental adaptation.

519
Among 1262 selected genomic regions reported by Yudin & Larkin (2019), only NEB gene was a shared 520 candidate gene among cattle, sheep, mammoth, polar bear, and whale genomes.

521
GO classifications of the candidate genes were performed to enable a better understanding of their molecular 522 functions. Based on the GO biological process (BP) for a significant threshold (p ≤ 0.05), we implemented 523 the GO on 11 overlapping candidate genes. Only six genes (TNIK, DOCK1, SFMBT1, SPO11, USH2A, and 524 TYW1B) associated with the 26 GO terms were identified. In total 11 GOs were related to TNIK and 525 DOCK1, which are associated with local adaptation (resistance against diseases). The first GO 526 (GO:0007010) included TNIK and DOCK1 genes associated with cytoskeleton organization, and six other

555
Any request for data should be addressed to the corresponding author.