1.1. Description of the dataset and markers used in the study and quality control
For the current investigation, we used genotyped data on 187 animals from eight distinct cow breeds. We used 50k SNP chip data from Tharparkar (n = 72), which was generated in our laboratory. We also included data from the WIDDE online portal (Online-Interfaced Next-Generation Database for Genetic Diversity Exploration). Our analysis includes indigenous breeds like Gir (n = 24), Sahiwal (n = 17), Red Sindhi (n = 10), Kankrej (n = 10), Ongole (n = 20), Hariana (n = 10), and Nelore (n = 24). Table 1 provides a summary of the dataset we used for our investigation. In our study, only autosomal SNPs are taken into account. For quality checking, PLINK v1.90 was utilised (Purcell et al., 2007). The quality control procedures, in brief, included the exclusion of markers; SNPs with a call rate less than 0.90 and SNPs with a MAF less than 0.05 were discarded. Furthermore, we excluded SNPs on the X and Y chromosomes as well as those without physical or chromosomal positions.
Table 1
Sample description for eight different cattle breeds used in the study
Sr. No | Name | Abbreviation | Size | Origin | SNPs before QC | SNPs after QC | Purpose |
---|
1 | Gir | GIR | 24 | Indigenous | 51998 | 22452 | Dairy |
2 | Hariana | HAR | 10 | Indigenous | 52497 | 15606 | Dual |
3 | Kankrej | KAN | 10 | Indigenous | 52497 | 21059 | Draught |
4 | Nelore | NEL | 24 | Indigenous | 51998 | 19458 | Draught |
5 | Ongole | ONG | 20 | Indigenous | 52497 | 27474 | Dual |
6 | Red Sindhi | RSI | 10 | Indigenous | 52497 | 23118 | Dairy |
7 | Sahiwal | SAH | 17 | Indigenous | 52497 | 20154 | Dairy |
8 | Tharparkar | THAR | 72 | Indigenous | 47515 | 21943 | Dairy |
1.2. Population stratification
In this work, principal component analysis (PCA based on eigen-decomposition) is the primary statistical method of population stratification. Using plink and the packages in the R programme, PCA was carried out on the dataset. Using the --distance-matrix option, PLINK calculates genetic distances. It generates the previously described enormous number matrix, which is recorded in the text file dataForPCA.mdist. Using multidimensional scaling in R, it is made simpler. This will be accomplished by first loading the text file into R along with the breed names (fam variable) and person IDs (famInd variable). The next step is multidimensional scaling, and computing the eigenvectors. The cmdscale() function is used for this, accounting for the first five primary components. We next combine the PCA findings with the breed and family ID data (eigenvec_populations) and compute the variance explained by each main component (eigen_percent). PCA was displayed using the tidyverse programme (Fig. 14). The actual visualisation is the next section of code, which combines base R with tidy coding techniques. The eigenvec_populations file's first and second main components will be plotted. Eigenvectors were plotted individually along a graph's axes.
1.3. Diversity estimation
The genetic diversity, ROH, inbreeding, linkage disequilibrium, effective population size, and haplotype block structure in Tharparkar cattle from India were all thoroughly analysed across the entire genome in this study. The study included a total of 72 Tharparkar animals that had previously undergone genotyping using an Illumina BovineSNP50 array. Several indices, including observed heterozygozity (HO), minor allele frequency (MAF), and inbreeding coefficients (F), were utilised to evaluate genetic diversity within the Tharparkar cow population. Under Hardy Weinberg Equilibrium (HWE), HO was compared to expected heterozygozity (HE). PLINK's —hardy flag was used to estimate the HO and HE. PLINK's —freqoption was used to calculate the MAFs. MAFs were classified into six categories: (0-0.05), (0.05–0.1), (0.1–0.2), (0.2–0.3), (0.3–0.4), and (0.4–0.5), and the proportion of SNPs in each was computed (Purcell et al., 2007). We calculated inbreeding coefficients using four different approaches (FROH, FHOM, FUNI, and FGRM). For each animal, FROH was computed by dividing the entire genome length covered by ROH (LROH) by the total genome length covered by SNPs (LAUTO) (McQuillan et al., 2008). The effective Population Size (Ne) was calculated using NeEstimator v2.136 and SNeP v1.1.37 based on linkage disequilibrium and recombination rate (c) between SNPs (Do et al., 2014; Barbato et al., 2015). Using pairwise LD values in SNeP software v1.1.37, we computed historical Ne for the Tharparkar population during the past generations. For SNPs, the bin width was set to 0.1 Mb, and the recombination rate (c) was adjusted in accordance with Sved and Feldman (Sved and Feldman, 1973).
1.4. Candidate Gene annotation, GO, and QTLs analysis
After quality control, the remaining SNPs were used for further investigation. To find potential functional genes under selection, we annotated the overlapping candidate genes found by different summary statistics. The CattleQTL database, PANTHER 17.0, and NCBI databases were used for annotation after determining the critical regions for selection. For gene annotation, we used the Bos_taurus_UMD_3.1.1 assembly. We've highlighted a few prominent genes, and GO terms that are experiencing selective sweep in the following section. Following are the four approaches we used in this study: Integrated haplotype score (iHS), Runs of homozygosity (ROH), Tajima's D, and Composite likelihood ratio (CLR).
1.5. Tajima’s D
It's a crucial statistic that is commonly used in population genetics. Tajima's D takes into account the difference between the mean pairwise difference and the number of segregating sites in nucleotide polymorphism data as a standard method for finding selection signatures (Tajima, 1989). The test statistic is equal to zero for neutral variation, positive when rare polymorphisms are overrepresented due to recent balancing selection for multiple alleles, and negative when high-frequency variants are overrepresented (Rajawat et al., 2022). We utilized the program VCF tools for a sliding window approach with a window size of 500 kb to estimate Tajima's D value (Danecek et al., 2011). As possible genomic regions inferring selective sweep, we chose the bottom 1% of empirical Tajima's D values.
1.6. ROH
Long sections of homozygous genotypes are known as runs of homozygosity (ROH), and they are thought to reflect portions shared exactly by descent because of processes including consanguinity, population size reduction, and natural selection (Szpiech et al., 2013). The detectRUNS R software was used to calculate the runs of homozygosity (ROH). ROHs were calculated using a sliding window-based technique with the detectRUNS R package. For ROH studies, the following threshold parameters were set: i) the maximum number of heterozygous SNPs (max-oppWindow-1); (ii) the minimum number of SNPs (minSNP-20); and (iii) the maximum number of missed calls per window (maxMissWindow-1). The ROH estimation threshold was set at 0.6, and passed SNPs were chosen as candidates for selective sweep (Almeida et al., 2019).
1.7 The integrated haplotype score (iHS)
Alleles within the same population are compared using this approach. This was calculated for each SNP using the R package’rehh’ as a representative for haplotype-based methods (Voight et al., 2006; Gautier and Vitalis, 2012). The iHS can identify incomplete or ongoing selection signatures (Sabeti et al., 2007). This approach is an expansion of EHH, which was developed by Voight et al. (2006) based on the comparison of EHH between derived and ancestor alleles within a population. We calculated |iHS| for each SNP across the genome of the breeds to find genomic areas that might have been targets of recent selection. Beagle v5.1 software was originally utilized for chromosome-wise haplotype phasing and imputation for missing genotypes in the iHS investigation (Rajawat et al., 2022). Subsequently, we calculated iHS values from a data frame including iHH values for ancestral and derived alleles using the rehh package of "R" version 3.2.2 (Gautier and Vitalis, 2012). The top 1% of regions are chosen for annotations.
1.8. Composite Likelihood Ratio (CLR)
A selective sweep alters allele frequency patterns at associated sites, excluding variation at tightly linked loci and resulting in an allele surplus at very low and very high frequencies in more remote areas (Williamson et al., 2007). The CLR test (Nielsen et al., 2005) was used as a statistical approach deployed to find a specific allele pattern following a selective sweep of frequencies along a chromosome. Then, for gene annotations, we have taken a region of 500 bp (Wang et al., 2018).