Differences in the quality of different varieties of mutton
There were certain differences in the nutritional components of the longissimus dorsi and leg muscles of the three sheep populations, as shown in Table 1. The collagen in the longissimus dorsi and leg muscles of the XQ population was significantly higher than in the other two populations (P<0.05). In the longissimus dorsi, the crude fat content of the XQ population was significantly higher than the other two populations (P<0.05), but the CHR population had the highest crude fat content in the leg muscles. It can be seen that whether in the longissimus dorsi or the semitendinosus, the muscle water content of the WZMQ population is higher (P>0.05). The protein content in the leg muscles of sheep had no significant difference between different sheep (P>0.05), while in the longissimus dorsi, the XQ population was significantly lower than the other two populations (P<0.05).
Sample whole-genome sequencing quality control and reference comparison
The raw data generated by sequencing was removed from the joints. The proportion of effective reads obtained by removing N bases and low quality is greater than 97%, indicating that the sequencing quality is good and there is a large amount of data available for subsequent analysis. The proportion of bases with quality values greater than or equal to 20 (Q20) in all samples is more than 96%; the proportion of bases with quality values greater than or equal to 30 (Q30) is between 92% and 95%, indicating that the quality of the sample data is excellent. The proportion of the total number of reads aligned to the reference genome for each sample was greater than 98%, indicating that the sequencing data was not disturbed by other data. The sequencing data were aligned to the consensus sequence obtained by the reads clustering, and the statistical results of the alignment rate of the samples are shown in Table 2. Since the reference used is the consensus sequence obtained by reads clustering, there will be some discrepancies in the alignment rate between different samples. In addition, the proportion of the total reads aligned to the reference genome in each sample was greater than 98%, indicating that the sequencing data was not contaminated by other data.
SNP and InDel statistics
SNP Statistical Results
SNP (Single Nucleotide Polymorphisms) refers to the genetic markers formed by the variation of a single nucleotide on the genome, which is numerous in number and rich in polymorphisms9. Variations of single nucleotides on the genome include substitutions, deletions and insertions. The ratio of conversion and transversion in the same species is the same. Statistics on the filtered SNPs, the results are as follows Table S1.
The distribution of SNPs in different regions of the genome varies in proportion (Fig. 2), thus affecting the function of certain genes. The effect on intergenic reached 19.01%, but had no effect on transcribed proteins; 38.58% of SNPs affected the introns of the sample genome, forming some intron variants; they had similar effects on the sample transcripts. More than half of the SNPs occurred in non-protein-coding regions, emphasizing that the SNPs that occurred in exons were less likely. For the obtained SNP data, further processing was performed, and filtering was performed according to MAF>0.05 and data integrity>0.8. SNPs with biallelic polymorphisms were retained. The SNPs obtained by the final screening enter into the subsequent analysis.
InDel Statistical Results
InDel (Insertion-Deletion) refers to the insertion and deletion of small fragments in the sample relative to the reference genome, which may contain one or more bases10. According to the position of InDel in the genome, it can be divided into InDel of coding and non-coding region sequence11. The occurrence of InDel in the coding sequence is related to the encoded protein function and amino acid site. If one or several bases (not multiples of 3) are inserted or deleted in the DNA coding sequence, the mutation is called frameshift mutation12. This kind of mutation will cause all changes in the DNA coding frame downstream of the insertion point or deletion point, and as a result, the amino acid sequence after the mutation point will be changed.
As can be seen from Fig. 3, 38.44% of InDel occurred in the sample transcript, which had an impact on the sample transcription and translation. The InDel occurring in the non-coding region will reduce the efficiency of transcription and the accuracy of splicing. Fig. 3 shows that more than half of the InDel occurs in the non-coding region of the genome, which has little effect on the appearance traits of the sample.
Genetic population structure of the sheep individuals
Principal components analysis (PCA) was performed based on the SNP, and the principal component clustering of all samples was obtained, as shown in Fig. 4a. In the PCA score plot, three principal components (PC1, PC2 and PC3) were extracted to be 7.53%, 2.96% and 2.89%, respectively. From the analysis in Fig. 4a, it can be concluded that WZMQ is tightly clustered, XQ is more dispersed but better than CHR. The PCA results showed that the three kinds of sheep could be completely separated, and there was no crossover phenomenon. Clustering with PC1 and PC2 showed better separation among the three samples, as was clustering with PC2 and PC3. When clustering with PC1 and PC3, it was found that the distance between XQ and WZMQ was relatively close, and CHR was completely separated from them.
The evolutionary history of individuals was inferred with the neighbor-joining (NJ) tree (Fig. 4b). The phylogenetic tree shows that the CHR population samples are clustered together individually. In a population, two loci on the same chromosome will be linked, that is, the genotypes of these two loci in the population are not in a random combination state, which is called "linkage disequilibrium" (LD). LD analysis was performed between different SNPs, and the linkage disequilibrium coefficient (r2) was calculated. It can be seen from the LD-decay decay diagram (Fig. 4c) that the LD of the CHR sample population decays slowly, indicating that the group formation time is short, the linkage exchange between individuals is insufficient, and the kinship relationship is relatively close; the LD decay of the WZMQ group is faster, indicating that this group is formed relatively ancient. The LD-decay decay map once again proves that The distance between CHR and the other two sheep is farther, and the distance between WZMQ and XQ is short. In addition to thisFor SNPs up to 50 kb apart, the average r2 values were equal to 0.093 (CHR), 0.082 (WZMQ), 0.083 (XQ). Further details about the LD analysis using r2 are included in supplementary Table S2. The result indicated that the LD decay tends to be stable when the distance is 100 kb. Therefore, genes located within ±50kb near important SNP sites can be listed as candidate genes in the follow-up study.
This experiment performed unsupervised cluster analysis using ADMIXTURE. In the analysis of population genetic structure, different K values represent the assumption that there are K ancestral groups (K=1~10). The analysis shows that when K=3, the three groups of samples have the best clustering. All three populations in the experiment had relatively uniform genetic components (Fig. 4d).
Fst analysis
The population fixed coefficient Fst reflects the level of population allelic hetero-zygosity. Fst was a basic indicator for traditionally measuring genetic differentiation among populations. Understanding population structure and genetic background in an evolutionary context13–15. The corresponding gene loci and their functions were found by analyzing the selected regions with the Fst value in the top 1%. Finally, the differential genes between different groups are found and the molecules are marked.
The three groups of samples were subjected to two-to-two Fst analysis to obtain the following Fig. S1. The highest Fst value between the CHR and WZMQ populations is located on the NC_019470.2 chromosome, indicating that the genetic differentiation between the two populations is relatively large. This region has undergone a strong selection and has more differential loci. The comprehensive analysis of Fig. S1 shows that the CHR population was more genetically differentiated than the other two populations. The Fst value between WZMQ and XQ populations was lower than 0.25. The degree of genetic differentiation in both the WZMQ and XQ populations was small.
Functional annotation of GO and KEGG genes in the screened region
CHR&WZMQ&XQ
The GO database divides gene functions into three parts: cellular component, molecular function, and biological process. The KEGG database is not only a functional annotation of genes themselves but also a database related to pathways. The final result obtained by GO and KEGG is the integrated macro result16. To detect candidate genes for wool traits by resequencing Chinese fine-wool sheep, a genome-wide association study (GWAS) was performed to detect candidate genes for eight wool traits. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment results revealed that many important pathways were associated with keratin and cell proliferation and differentiation17.
Regions were screened with the 99% threshold of Fst value between WZMQ and CHR, and genes contained in all regions were subjected to GO annotation analysis. As a result, 519 genes were enriched into 50 GO entries. More than 80% of the gene numbers in Fig. S2a were enriched in the cytoplasm (nucleus). Among the GO entries, bio-logical processes were less enriched. KEGG pathway analysis was performed on the genes contained in the region (Fig. S2b). Most genes were annotated to pathways related to signaling, infectious diseases and the immune system. Two of these pathways were annotated as being related to the degradation and metabolism of xenobiotics.
Region screening was performed with the 99% threshold of Fst value between XQ and CHR. GO annotation was performed on the genes contained in the region. As a result, 396 genes were enriched in 50 GO items (Fig. S3a), of which the number of genes enriched in 25 biological processes didn’t exceed 30%. Among these GO entries, the most enriched genes were cytoplasm. There were some differences in gene GO annotations between CHR and the other two groups of populations (Fig. S1a and Fig. S2a). In terms of molecular function between the CHR population and the other two populations, the genes contained in the different regions were mostly enriched in the combined sub-entry. In the biological process, they were mostly enriched in transcription-translation-related entries. The KEGG pathway analysis of the genes contained in the region showed that (Fig. S2b) about 15% of the genes were enriched in the pathways related to signal transduction; about 9% of the genes were enriched in the pathways related to the immune system. The results showed that the difference region between the two populations of CHR and XQ contained more genes than the samples were related to environmental adaptability. Among them, there were 10 KEGG pathway categories related to body metabolism.
WZMQ&XQ
Regions were screened with the 99% threshold of Fst value between WZMQ and XQ. GO annotations were performed on the genes contained in the regions. The results showed that 405 genes were annotated in 50 GO items (Fig. S3a). About 79% of the gene GO annotations were related to the cytoplasm. Then the KEGG pathway analysis of the genes contained in the region showed that 39 pathways were enriched in the Signal transduction category, indicating that most of the genes screened in the region were related to the body's signal transmission. The KEGG pathways related to body diseases are more enriched (Fig. S3b), comparing WZMQ with the other two sheep breeds, it can be found that a large part of the genes screened by region is enriched in the pathway categories related to sheep diseases. There are 58 pathways enriched in body metabolism, of which 12 pathways are enriched in the category of Carbohydrate metabolism.
The GO annotation map of the genes in the regional screening between WZMQ and the other two sheep breeds shows that most of the GO annotations of the genes in the three sheep regions are related to cellular components (Fig. S1a, S2a and S3a). The number of pathways enriched in the immune system and endocrine system of WZMQ, XQ and CHR were significantly higher than in other systems (Fig. S1b, S2b and S3b), In terms of body metabolism, the number of enriched pathways between XQ and WZMQ is lower than that of CHR and WZMQ.
Construction of three kinds of sheep differential loci
Genomes of different individuals were sequenced by whole-genome resequencing of sheep of the known genome sequence. Through sequence alignment, a large number of SNP, InDel and other variation information can be found. In turn, it reflects the differences between different sheep individuals or groups at the genome-wide level. As shown in Table S3, in the five sheep gene fragments corresponding to chromosomes NC_019470.2 and NC_019474.2, the CHR population is completely different from the other two populations in base types. Several differential loci shown in Table S3 can completely distinguish CHR from the other two populations, but cannot distinguish the two populations of WZMQ and XQ. It is necessary to combine more differential sites to achieve the purpose of differentiation.
In the two sheep gene fragments corresponding to chromosome NC_019484.2, the bases of WZMQ and XQ populations are completely different at several sites on this chromosome (Table S4). Finally, the two populations of WZMQ and XQ can be distinguished by these loci. According to Table S3 and Table S4, a schematic diagram like Fig. 5 can be drawn. It can be seen that the three types of sheep can be distinguished by the differences in the three chromosomes.