The ability to identify chicken breed or brands on the market at the genetic level could increase consumer trust. Previously used mtDNA sequence variation and MS markers remain useful to verify breeds. However, establishing an automated verification system for these methods take a long time, and an experienced operator with analytical skills is also required (Oh et al. 2014; Guo et al. 2017). SNP markers provide limited variant information compared to MS markers; however, a combination of several SNPs can provide sufficient information for classification. In addition, the cost of genotyping is continually falling, and customizable SNP genotyping platforms can be used as next-generation verification tools that can respond accurately and quickly to market demands.
However, identifying the minimum number of markers from a high-density SNP array for identification of a target population is not simple. In previous studies, independent SNPs determined by canonical discriminant analysis (CDA), the delta statistic, the F statistic, and PCA were used for genetic classification, and breed identification using low-density SNP arrays has also been demonstrated (Dimauro et al. 2013; Bertolini et al. 2017; Judge et al. 2017; Schiavo et al. 2020). In these studies, using an 600K SNP genotyping array for chicken, three combinations of 96 SNP markers were selected based on the results of a GWAS and LD analysis, where the new chicken breeding stock (with HH, HF, and HY as the founder populations) was the case and the remaining chicken groups were the controls. The feature selection function was applied SNPset3 to determine the minimum number of markers required for discrimination of the target group. The machine learning algorithm showed high discriminatory power (Figure 1).
Identification of target chicken population based on genetic components
New chicken breeding stocks produced by three-way crossing require a combination of shared markers that can be used to clearly distinguish them from the other chicken populations. Twenty chicken populations were used in this study (Figure 4). Of these, twelve chicken populations (HH and HF, HG and HV, HS and HW, NC and ND, NS and NH, NR and NY) had a shared origin; therefore, a total of 14 chicken populations was predicted as the independent chicken breeds. The HS, HW, NC, and ND lines all originated from Rhode Island Red (Seo et al. 2018), and the CC lines also shared part of genetic components with them. Twelve genetic components could be used to determine the origins of the chicken populations; it was difficult to discriminate them using fewer marker genotypes.
The populations to be classified had HH, HF, and HY as their parental lines. It was difficult to distinguish HH, HF, and HY from the other chicken populations using a limited number of SNP markers. In terms of genetic distance, HH and HF were very close (0.09), but HY was relatively distant from those two breeds on the MDS plot (genetic distances of 0.25 and 0.27, respectively). The HY population was more closely related to the other chicken populations than HH and HF. Therefore, it was difficult to identify a marker shared by all three founder populations. The same approach was used to classify breeds by population-specific alleles, similar to the existing mtDNA and MS marker classification approaches. However, different results were obtained using the different marker combinations when the verification samples were added (data not shown) because the SNPs extracted from the array were not conserved in each population. On the other hand, mutations in mtDNA or MS markers do not affect the function of genes, and are selected based on the mutation occurring from the maternal origin of the population (mtDNA marker), or the specific allele (MS marker) of the population is used as an identification point for classification. Thus, it was difficult to discriminate populations with a small number of samples using the population-specific SNPs.
GWAS and LD analysis for identification of the target population
Classification analysis was performed to overcome the limitations of the population-specific markers mentioned above. The HH, HF, and HY populations were set as the case group, and the remaining 17 populations were set as the control group. The 96 markers selected by the GWAS were strongly related to the case group. The case and control groups tended to form distinct clusters but were not clearly distinguished using only the GWAS significant SNPs. Therefore, LD pruning was performed and confirmed that SNPset3, which selected 50 SNPs per LD block, could clearly distinguish the two chicken groups due to remove the sharing of LD blocks between markers, or the relationship between adjacent LD blocks.
Regarding SNPset1, individuals with high genetic similarity had a high degree of clustering. Several samples overlapped in the MDS plot. When using SNPset2 and SNPset3, which selected SNPs based on the LD block, the clusters were separated by their relationships. If the SNP markers were selected based on their p-values in the GWAS, those having a strong correlation with the case group were affected by the LD relationships between marker distances. It was confirmed that 95.8% of SNPset1 (92 of 96 SNPs) shared 39 LD blocks on GGA1 (Supplementary Figure 3). Additionally, 70 SNPs in SNPset2, and 37 in SNPset3, were located on GGA1. Using the AdaBoost model, which had excellent discriminatory power, only six SNPs were selected from GGA1. Thus, many SNPs were strongly related to the case group in GGA1 but provided redundant information and probably interfered with the classification of the two groups. Selection of SNPs in the case group based on GWAS analysis could increase the genetic distance between the case and control groups. It was difficult to distinguish the two groups based on the Fst, but it was confirmed that the genetic distance between the case and control groups was significantly increased (Figure 3). The optimum K-value in the admixture analysis decreased from 12 to 2 at the minimum number of marker combinations by the Adaboost model (Figure 4).
In previous studies, methods for selecting the minimum number of high-density SNP markers for breed identification using the delta statistic and Fst were reported. More than 300 and 591 breed-specific SNPs were selected by Judge et al. (2017) and Kumar et al. (2019), respectively. These relatively large numbers of SNPs were used to form a panel to discriminate among target breeds. Another study sought to identify the minimum number of markers needed for breed identification using the delta statistic, PCA, and an RF algorithm (Bertolini et al. 2017). Combinations of markers (48- and 96-SNP panels) capable of distinguishing among various cattle breeds were presented; efficient identification was possible with fewer markers than in previous studies.
Our GWAS and LD analysis were not performed to identify markers capable of distinguishing among all of the populations included in the study, but rather to distinguish only the target population from the others. It is therefore difficult to directly compare the results with those of previous studies. Comparing the Fst, the genetic distance, and the genetic component of the research population before and after marker selection, it was confirmed that the changes of genetic distance and genetic composition calculated by the selected marker were significantly changed for target population except for Fst. The genetic distances were calculated based on allele frequencies, and the results were similar to those obtained using the delta score. The explanatory power of the principal component in the PCA analysis increased when using case-associated markers. The validation study results remained consistent after adding samples from other populations that were not used for marker selection. The 96 selected markers (SNPset3) well explained the genetic components of the target chicken group.
Machine learning algorithms for classification of the case and control chicken populations
Machine learning is a supervised learning approach for classifying new observations that can be used to classify bi-class or multi-class data. Machine learning can be used for voice and handwriting analysis, and document classification. In recent years, machine learning and deep learning algorithms have been used to determine phenotypic associations (e.g., in the genome, transcriptome, and methylome) in “omics” research, and to establish classification models (Pérez-Enciso & Zingaretti 2019; Alves et al. 2020).
In this study, eight machine learning classification models were used to efficiently identify target chicken populations. PCA was conducted with a machine learning algorithm to confirm whether case and control groups could be distinguished, based on the 96 markers in the SNPset3. All classification algorithms showed 100% breed classification accuracy except the Naïve Bayes model (98.5%; Figure 6). The AB, RF, and DT algorithms select a subset of variables through a feature selection process. In general, machine learning models utilize this method to: 1) simplify the model for easier interpretation, 2) shorten the training time, 3) avoid the dimensional curse problem, and 4) reduce overfitting (i.e., reduce variance) (Bermingham et al. 2015). In this manner, duplicate or less relevant variables are removed, so the minimal number of SNP markers required for efficiently classifying chicken populations can be identified.
With use of feature selection, 36, 44 and 8 SNP markers were selected by the AB, RF, and DT models, which had classification accuracies of 99.6%, 97.9% and 98.0%, respectively (Table 2). Thus, the target group could be classified using a small number of markers.
In the validation study including additional samples, both founder group and non-founder group chickens could be classified. The added samples included PL and CC samples from the founder population, and various samples obtained from the Korean chicken market (including new breeds not included in the 600K SNP genotyping array; e.g., WM_2, Yelim K, and HI). The discriminatory power was excellent, even when samples from a group of chickens with a completely different genetic composition to that used in the initial marker selection process were added (Figure 7). Our SNP markers were not population-specific. Therefore, by adding samples, the allele frequency and breed classification results could change. However, no significant changes were seen, and the cases and control groups were classified with high accuracy (Table 3).
The AB and RF models showed similar clustering results with and without the additional verification samples, while the DT model produced relatively diffuse clusters. Overall, the accuracy was similar among the classification models. The SNP marker combinations selected by the three models can be used for classifying the target chicken population. However, the DT model showed changes in clustering with additional samples, so requires further verification.