Resource population and phenotypes collection
Simmental beef cattle (born between 2008-2015) were established in Ulgai, Xilingole League, Inner Mongolia of China. Six body size traits at three growth stages (6, 12, and 18 months after birth) were measured simultaneously for each individual. The details of trait measurement are as follows: BH, also named wither height, the height from wither to ground; BL, the length from the front edge of the scapula to the back edge of the ischial tuberosity; HH, height from hip to ground; HS, also named heart girth, the chest girth in the back edge of scapula; AS, the girth of the thickest part of the abdomen; CS, the girth of the thinnest cannon bone. In addition, only cattle (133 individuals, Table S1) whose phenotype records overlapped during three growth stages were used for LONG-GWAS analysis. The blood sample specimens were collected during the regular quarantine inspection of the farms was conducted. All procedures were conducted in strict compliance with the guidelines established by the Ministry of Agriculture of China. Some descriptive statistics and heritability estimates of six traits at three growth stages are presented in Table 1.
Genotyping and quality control
Genomic DNA was isolated from blood samples using the TIANamp Blood DNA Kit (Tiangen Biotech Co.Ltd.,Beijing, China). DNA quality was acceptable when the A260/A280 radio was between 1.8 and 2.0. Genotyping was performed with the Illumina BovineHD Beadchip (Illumina Inc., San Diego, CA, USA) and the PLINK v1.07 Software was used for quality control . In this study, animals with a call rate (0.9) were discarded. SNPs were deleted the following standards, including minor allele frequency (0.01), SNP call rate (0.05) and Hardy-Weinberg equilibrium values (p110−6). Finally, 671192 SNPs on 29 autosomal chromosomes with an average distance of 3 kb were generated for the analysis.
For six body size traits in three growth stage, we performed single-trait GWAS, respectively. The compressed mixed linear model (CMLM) was called because it reduced computing time by clustering individuals into groups, increased the power in QTN detection by eliminating the need to re-compute variance components, and enhanced the effectiveness in correcting the inflation from the polygenic background and controlling the bias of population stratification [61, 62]. Briefly, a principal components analysis (PCA) was performed and a kinship matrix was calculated using the Genome Association and Prediction Integrated Tool (GAPIT) package in R v3.4.2 . To revise the effects of population structure, the Q matrix was reflected by the PCA. To replace the incomplete pedigrees, the K matrix was calculated by the VanRaden algorithm . This model is as follows:
See formula 1 in supplementary files.
where y is a vector of the observed phenotypes; W was a vector of SNP genotype indicators, which was coded as 0, 1 and 2 corresponding to the three genotypes AA, AB, and BB with B being the minor allele. υ was the effect of marker, which is treated as a fixed effect; Variable is an incidence matrix for non-genetic fixed effects, and is a non-genetic vector of fixed effects including month ages (time of birth to measurement), enter weight (weight of just entering the farm), fattening days (time of entering farm to measurement) and principal component effects (the top three eigenvectors of the Q matrix). Variable Z is an incidence matrix for a vector of polygenic effects, and parameter u is a vector for residual polygenic effects with an assumed N (0, Kσ2) distribution, where σ2 is the polygenic variance and K is a marker inferred kinship matrix. While е is a vector for random residual errors with a putative N (0, I) distribution, where is the residual variance. The heritability () is defined as: =. The CMLM analysis was performed with GAPIT Software package (http://www.maizegenetics.net/gapit). Quantile–quantile (Q–Q) plots were generated to visualize the goodness of fitting for the GWAS model accounted by the population structure and familial relatedness. The negative logarithm of the p value from the model was calculated against the expected value based on the null hypothesis. The threshold p value after Bonferroni correction was 0.05/N = 7.45×10-8, where N is the number of SNPs. In light of the fact that the Bonferroni correction results were too stringent with low statistical power . Hence, we adopted the false discovery rate (FDR) to determine the threshold values for Single-GWAS, Multi-trait GWAS and LONG-GWAS. The FDR was set as 0.01, and the threshold p value was calculated as follows:
See formula 2 in supplementary files.
where n is the number of P<0.01 in the results, and m is the total number of SNPs .
For the three growth stages, the multi-trait GWAS were conducted to detect pleiotropic SNPs and the model was a Chi square statistic, which approximately followed a Chi square distribution with the number of traits tested as the number of degrees of freedom. It was calculated for each SNP using the following formula :
See formula 3 in supplementary files.
where is the estimate of v and the corresponding variance can be obtained by the compressed mixed linear model (CMLM); While is the 6×1 vector of the signed t-values of the ith SNP from the above-mentioned single-trait GWAS for the six traits. Matrix is the transpose of the vector, and is the inverse of the 6×6 correlation matrix between traits, which was calculated by the estimated effects of the qualified SNPs (signed t values).
For these individuals that completely obtained body size information at three growth stages, we conducted a longitudinal GWAS by LONG-GWAS . This model was similar to CMLM, except that the phenotypic variance was partitioned to SNPs, fixed factors (the above-mentioned vector), polygenic effects, time stage effects and residual variance. Moreover, numerous studies have reported that the longitudinal design could facilitate the identification of time-dependent and consistent loci, which increased the statistical power due to their effectiveness in incorporating the correlation structure of multiple measurements and alleviating the multiple testing burden [26, 27, 69]. The code data implementing this method may be found at http://genetics.cs.ucla.edu/longGWAS/. This model is as follows:
See formula 4 in supplementary files.
In this formula, is the adjusted phenotype (these fixed effects mentioned in CMLM were adjusted in order to account for additional confounding). The incidence matrix W, Z, the vector v, u and e are consistent with CMLM mentioned above. Differently, the parameter is a vector for time stage effects with a putative N (0,) distribution, where D is a known block diagonal matrix representing the covariance between permanent environmental components. The D matrix was calculated by this formula: D= E ⊗ I, where E is a 3 × 3 matrix representing the covariance between the set of 3 time points for each individual.
Identification and annotation of candidate genes
The UMD3.1 genome assembly was used to located genes for annotation, and the QTLdb database (http://www.animalgenome.org) was applied to search for related QTL regions.