Sample Collection and Genomic DNA Extraction
Total genomic DNA was isolated from soft tissues (muscles, heart, liver, and kidney) and blood samples, depending on the availability of tissue collected by the Mammal Research Institute, Polish Academy of Sciences in Białowieża between 1990 and 2016. Before DNA extraction, soft tissues – blood, liver, muscle and skin – were preserved at -20°C.150 European bison (126 males, 24 females) were employed in this study. The total number of cases was 84, and the controls were 66, including 41 non-affected males and 79 with confirmed posthitis. The rest were males with undefined disease status and females. DNA extraction was performed using the following commercial total DNA isolation kits: Syngen DNA Mini Kit (spin-column protocol, Wrocław, Poland), Qiagen DNeasy®, Blood & Tissue Kit (spin-column protocol), and Sherlock AX Kit (Gdansk, Poland), A&A Biotechnology, a procedure with DNA precipitation, Gdansk, Poland), as per the manufacturer’s guidelines. Most of the materials available were blood samples, and the phenol-chloroform extraction method was used to obtain DNA.
Enrichment Library Preparation and High-Throughput Sequencing (SureSelectXT Target Enrichment System)
In this project, we studied genes or regions suggested previously as significantly associated with posthitis disease[2] as well as other specific regions that might be associated with posthitis[32, 39, 40]. A total of 74 regions or 74 genes (Supplementary Table S2) were captured, distributed across chromosomes 1, 9, 12, 13, 15, 23, 25, 26, 29, and X. This custom capture platform includes 3-5.9Mb targeted features (SureSelectXT Custom Target Enrichment Kit, Agilent). Library construction and sequencing were performed according to the manufacturer’s protocols. High-throughput sequencing was performed on a HiSeq 4000 Illumina platform using 100 bp paired-end sequencing runs at BGI lab www.bgi.com).
Read alignment, variant calling and filtering
All reads were trimmed and filtered for adapter using default parameters using Trim Galore [49] (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). A read quality check was performed before and after trimming and filtering data using the FastQC tool [50]. Quality Check reports were generated for all fastq files. After filtering and quality checking, reads were mapped onto the Bos taurus reference genome (UMD 3.1) using BWA-MEM version 0.7.17 [51, 52] producing a binary alignment file for each sample.Read group information was included with the BWA mem program during the alignment step.
Data pre-processing steps and Variant discovery [53, 54, 55] were performed following the GATK Best Practices https://software.broadinstitute.org/gatk/best-practices/) using GATK [54]. Aligned reads were sorted by query name using SortSam. The GATK FixmateInformation program was used to ensure consistent information appears for both read in pairs on the sorted alignment file by query name; the GATK SortSam module was then used to sort the output file according to genomic coordinate orders. Duplicate reads were marked using the GATK’s MarkDuplicates, and SetNmMdAndUqTags was used to fix the Tag. The final bam file for each sample was verified using GATK’s ValidateSam. After preprocessing, GATK's Haplotype Caller [54] was run on individual BAM files in GVCF (genomic variant call format) mode to produce a separate gVCF file for each individual within the targeted regions (genomic interval) by 100 bp interval padding. The Interval list file, which corresponds to the genomic regions targeted during library preparation, was used to perform variant calling steps. An Interval list might also be provided by the kit manufacturer. For the targeted sequencing approach or exome sequencing approach, it had been suggested to add interval padding (usually 100 bp) by the GATK (https://gatk.broadinstitute.org/). GATK’s CombineGVCFs were used to produce one gVCF file by merging gVCF files for each sample.After combining gVCF files, GATK’s GenotypeGVCFs module was used to convert the gVCF file to a VCF file. The finally obtained VCF file contained all raw variants for downstream analysis. GATK ‘s SelectVariants module was used to extract the raw SNPs and Indels from the combined VCF file. Then these extracted SNPs and Indel variants were subjected to Hard filters. Hard filtering discards variants below specific thresholds for properties, such as variant confidence, root mean square of the mapping quality and strand bias. Variant annotations quality (qual), depth (DP), Fisher Strand (FS), root mean square of Mapping Quality of reads supporting a variant call (MQ), quality by depth (QD), Mann–Whitney–Wilcoxon Rank Sum tests MQRankSum, ReadPosRankSum, BaseQRankSum and ClippingRankSum—characterize low level properties of variants from information in the BAM file.
GATK’s Variant Filtration module was used to select and filter the high-quality SNPs (-filter-name "QD2" -filter "QD < 2.0" -filter-name "FS60" -filter "FS > 60.0" -filter-name "MQ40" -filter "MQ < 40.0" -filter-name "SOR3" -filter "SOR > 3.0" -filter-name "MQRankSum-12.5" -filter "MQRankSum < -12.5" -filter-name "ReadPosRankSum-8" -filter "ReadPosRankSum < -8.0" -filter-name "QUAL30" -filter "QUAL < 30.0") and Indels.
After Hard Filtering, variants that were not assigned a flag as “PASS” by GATK were excluded from downstream analyses using BCFtools [56]. The resulting VCF file was used for GWAS analysis in PLINK 1.9[57, 58]and Annotation of Variants using Ensemble’s Variant Effect Predictor (VEP) tool.
Genome-wide association analysis (GWAS)
Before converting the final VCF file to PLINK 1.9 format, only bi-allelic single nucleotide polymorphisms (SNPs) variants were retained using BCFtools. Bi-allelic variants were used for downstream analysis as PLINK is unable to utilize multi-allelic SNPs. Filtered VCF file containing only QC-passing variants was converted into PLINK binary-format files (*.bed, *.bim, *.fam) using PLINK. We used PLINK to retain male bison, and GWAS was performed only on these files. A principal component analysis was also performed and plotted.
Statistical Analysis
PLINK was used to perform quality control using several parameters (--mind 0.1 --geno 0.1 --maf 0.05 HWE p-value < 1e-6). We excluded SNPs based on minor allele frequency (MAF < 5%), 10% of missing genotypes and HWE p-value < 1e-6 before carrying out association analysis.
GWAS was conducted using logistic regression with an additive genetic model, implemented in PLINK 1.9. The value of the confidence interval (CI) was 95%.For -model and case/control --assoc, '--ci X' causes size-X centred CI to be reported for OR. (E.g., "-ci 0.95" corresponds to a 95% confidence interval).
The Bonferroni correction has been used to adjust for multiple testing corrections in several studies [59, 60, 61]. Due to the high level of inbreeding and the limitation of the sample, a Bonferroni corrected genome-wide significance level was too conservative to be applied in our case, hence SNPs with p < 0.005 and OR > 1 were considered as significant trait association [62]. We used OR, understood as a relationship between the likelihood of an event occurring and a variable, as a measurement of association [63]. Then the genome-wide significance was calculated as a p value < 0.05/n, where ‘n’ denotes the number of obtained single nucleotide polymorphisms (SNPs).
Annotation of Variants
The Variant Effect Predictor (VEP) tool [36] was used to perform annotation of genetic variations. It is the most widely used annotation tool to annotate the genomic variation found in high-throughput sequencing data. Using BCFtools, the VCF file generated after Hard filtering was utilized to retain biallelic variants. Variant annotation was performed on the resulting VCF file, both with and without additional variant filtering. A total of 150 samples (both male and female) were utilized to conduct variant annotation. The default bos taurus genome assembly (ARS-UCD 2.1) was selected to perform annotation using the VEP.
VCFtools [53] was used to implement additional variant filtering. The following parameters were used to filter the variants: - -average read depth between 10 and 60 (--min-meanDP 10 --max-meanDP 60); -only biallelic variants were retained using the minimum and maximum alleles to two (--min-alleles 2 --max-alleles 2 ); -greater than 80% missing data across all samples (--max-missing 0.8); -all SNPs with quality scores less than 20 (--minQ 20); -minor allele frequency (MAF)(--maf 0.05).
Principal Component Analysis (PCA)
Principal component analysis (PCA) was carried out on the quality-filtered dataset using PLINK 1.9’s “PCA” command. This produced output files with the first 20 ordered eigenvectors and eigenvalues. The first two components were used for subsequent analysis and visualization.
Juxtaposition of SNPs Obtained from Target Enrichment, Reference Pipeline, and Bovine High-Density SNP Chip Tool
We compared the number of variants acquired from different pipelines and checked the uniqueness of the variants obtained in our previous work [38] with the Bovine High-Density SNP Chip Tool.