Animals and genotypes
A total of 373 animals from nine breeds were chosen to study which including two indicine (Brahman = 15), (Nellore = 26), seven taurine (Angus = 27), (Holstein = 76), (Hereford = 25), (Hanwoo = 66), (Jeju Black = 78) and (Brown Wagyu = 10 and Black Wagyu = 50). Among the animals, Hanwoo DNA was extracted either from commercial AI bull semen straws or tail hair samples obtained from different farmers with the permission of the owners and Jeju Black cattle DNA was prepared either from AI bull semen straw or tail hair in Jeju Island, Korea and Holstein DNA was also collected from semen straw provided by Nunghyup Dairy cattle improvement center. All other cattle DNA was provided by Texas A & M University, USA except Japanese Wagyu breed. No ethics statement was required for the collection of DNA samples from the Brahman, Nellore, Angus and Hereford cattle because DNA samples were provided by the two authors in USA under their rules and regulations. Japanese Wagyu breeds data was provided by the author in Japan which are previously used and published by other study. Genomic DNA purification and Genotyping was accomplished by DNALink, a commercial genome analysis service provider in Korea.
SNP genotyping and assembly of data sets
All of the breeds except Wagyu were genotyped using SNP chip by DNALink which includes common SNP from Bovine 50K v.3 BeadChip (Illumina, San Diego, CA, USA). Japanese Wagyu cattle breeds were also genotyped using Illumina bovine 50K SNP chip and data was provided by Kobe University, Japan. All the genotyped data was then merged and common SNPs on autosomal chromosomes were selected which resulted 45,526 SNPs in the final dataset across the breed.
Quality control and filtering of SNPs
SNP quality control and filtering based on recorded SNP genotypes were performed across nine cattle breeds to remove SNPs from further analysis if any SNPs with less than 95% call rate, and animals (sample) with less than 95% call rate. This left about 41,186 SNPs in 372 animals across the breeds according to Plink software program [35]. One Holstein breed failed to pass the QC criteria. SNP quality filtering was performed using PLINK version 1.9 and SNPs that were in high LD were pruned using the following parameter, -indep pair wise command 50 5 2 (SNP window size: 50, SNPs shifted per step: 5, r2 threshold: 2). Pruning of SNPs that are in high LD have been shown to counter the effect of ascertainments bias and to generate meaningful comparison between breeds [36]. After pruning, a total of 18,524 SNPs was remained for analysis.
Estimates of within breed genetic diversity
Genetic diversity within cattle breeds can be estimated by three measures such as expected heterozygocity (HE), observed heteorozygocity (HO) and inbreeding coefficient (FIS). All of these parameter was calculated using R software package divRsity v1.9.9 [37].
Analysis of molecular variance (AMOVA) and population differentiation
Analysis of molecular variance to determine the partition of genetic diversity was performed in cattle breeds. The genetic variability among nine cattle breeds was estimated using the program ARLEQUIN v3.5 [38]. Population differentiation was calculated by pairwise FST estimates according to Weir and Cockerham’s, 1984 [39] approach using R package divRsity v1.9.9.
Population structure Analysis
Population structure of the studied cattle breed was also carried out using the software ADMIXTURE v1.3. [40]. It is a very useful and popular tool to analyze SNP data. It performs an unsupervised clustering of large numbers of samples, and allows each individual cattle breed to be a mixture of clusters. ADMIXURE uses a model-based estimation of individual ancestry for a range of prior values of K defined by the user. In order to elicit the true number of genetic populations (clusters or K) between nine cattle breeds, a cross validation (CV) approach was used to determine the most likely number of populations (K) in the SNP data. The best possible number of ancestral population (K) was inferred through 3 to 11 pre assumed populations. For each tested value of K, ADMIXURE estimates the proportion of each individual’s genotype derived from each cluster. A preferable value of K will show low cross validation error compared with other K values.
Principal component analysis (PCA)
Principal component analysis (PCA) was carried out to infer the relationship among the nine cattle populations by using PLINK v1.9. PCA determine breed relationships based directly on allele frequencies by using a multivariate method, which condenses the information from a large number of alleles and loci into a few synthetic variables known as principal components [34].
Genetic distance
Phylogenic analysis on the basis of SNP data has become an important tool for studying the evolutionary history of any organism. Many statistical methods and software developed for these purposes. In this study, we construct phylogenic tree by calculating Nei’s Genetic distances (DA) [16] using Poptree 2 [41] program. Neighbor-Joining (NJ) method applied for measuring Nei’s genetic distance, DA which is defined by
where xij and yij are the frequencies of the i-th allele at the j-th locus in populations X and Y, respectively, mj is the number of alleles at the j-th locus, and r is the number of loci used.
Linkage disequilibrium (LD)
Various statistical approach were practiced to measure the extent of LD. Among them, Lewontin’s D′ [42] and Hill’s r2 [43] were widely used to measure the extent of LD, although their functions are different. But the range of both measures is between 0 and 1. LD estimator, r2 is preferred for association studies for its robustness, simplicity and not sensitivity to changing gene frequency and effective population size. On the other hand, estimates of D′ are sensitive to allele frequency, especially when one allele is rare, and is inflated for small sample sizes [44]. Therefore, we focused on r2 measure for further characterization of LD. The r2 estimator represents the squared correlation coefficient (r) between two variables (alleles) at two separate SNP marker loci [45] and r2 = 1 when only two haplotypes are present, which is usually a consequence of genetic drift or population bottlenecks. PLINK software [35] was used for estimation of r2 parameter for post quality control SNPs with known genomic location of all autosomal chromosomes. Here, r2 can be expressed by the following equation:
Where, Pab represents the frequency of haplotypes consisting of allele a at the first locus and allele b at the second locus [17]. Pairwise r2 values were calculated for each chromosome in each breed, as well as across breed, with the - r2 command, using the default settings. In this study, we used PLINK ‘— ld-window and — ld-window-r2 commands’ values, so that correlations between all possible SNPs were tested and that there was no r2 threshold, in order to encapsulate all possible linkage interaction between SNPs per chromosome. To display the decay of LD, distances of pair-wise SNPs were binned into twenty types of intervals (0 to 1 kb, 10 kb intervals starting from 1 up to 100 kb and 100 kb interval starting from 100 up to 1 Mb). For each chromosome, r2 values were then sorted by inter-SNP distance, and averaged across the afore-mentioned intervals to observe possible r2 patterns for increasing inter-SNP distances.
Effective population size (NE)
Effective population size (NE) was estimated using the software tool, SNeP v1.1 by Barbato [46]. SNeP estimates NE from linkage disequilibrium (LD) data, using the following formula suggested by Corbin [47],