Predictions of Indigenous Chicken Phenotypes from Genotypes: Comparison Between Machine Learning and conventional Linear Models

Genomic selection is a new breeding strategy which is rapidly becoming the method of choice of selection. It is useful in predicting the phenotypes of quantitative traits based on genome-wide markers of genotypes using conventional predictive models such as ridge regression BLUP. However, these conventional predictive models are faced with a statistical challenge related to the high dimensionality of marker data, inter and intra-allelic interactions and typically make strong assumptions. Machine learning models can be used as an alternative in the prediction of phenotypes due to their ability to address this challenges. Therefore, the aim of this study was to compare the predictive ability of machine learning using deep convolutional neural network (DeepGS), conventional neural network (Artificial neural network), conventional statistical predictive model ridge regression best linear unbiased (RR-BLUP) model and combination of DeepGS and RR-BLUP(Ensemble model) in predicting body weight (BW) of indigenous chicken based on genome-wide markers. The pearson correlation coefficient (PCC) results from this study for the four models were 0.891,0.889, 0.892 and 0.812 for DeepGS, RR-BLUP, Ensemble and ANN. This showed that DeepGS did not yield significant difference (p>0.05) from the other models, therefore, it can be used in complement to the commonly used conventional models. For individuals with higher phenotypic values, the PCC results showed a drastic decrease in the performance of DeepGS, rrBLUP, Ensemble and ANN from 0.891, 0.889, 0.892, 0.845 to 0.315, 0.466, 0.342, 0.518 respectively. Therefore, more effort should be put on individuals with higher phenotypic values.

genome-wide markers. The pearson correlation coefficient (PCC) results from this study for the four models were 0.891,0.889, 0.892 and 0.812 for DeepGS, RR-BLUP, Ensemble and ANN. This showed that DeepGS did not yield significant difference (p>0.05) from the other models, therefore, it can be used in complement to the commonly used conventional models. For individuals with higher phenotypic values, the PCC results showed a drastic decrease in the performance of DeepGS, rrBLUP, Ensemble and ANN from 0.891, 0.889, 0.892, 0.845 to 0.315, 0.466, 0.342, 0.518 respectively. Therefore, more effort should be put on individuals with higher phenotypic values.

Background
Over the past decades, phenotypes have been used to select superior animals for breeding in genetic improvement programmes. However, introduction of genomic selection 3 (GS) technology as a breeding strategy tool has led revolution in genetic improvement programs. Genomic selection is a new breeding strategy which is rapidly becoming the method of choice of selection for animal breeders (Hayes et al., 2009). This is attributed to its ability to accurately identify genetically superior animals at a young age based on genomic estimated breeding values (GEBVs). Genomic selection is useful in predicting the phenotypes of quantitative traits based on genome-wide markers of genotypes using conventional predictive models. The main benefit of GS is its ability to accurately identify genetically superior animals at a young age. The increased accuracy and persistence of prediction based on conventional statistical models at a young age make GS more efficient than progeny testing (Calus et al., 2013). The latter is hindered by the long generation interval and low reproductive rate thus lowering genetic progress.
The success of genomic-wide selection is related to the choice of the prediction method.
The association between predictors and target variables which may involve non-additive effects is also ignored, leading to biased results and low predictive abilities (Wenlong et al., 2017). The models make strong assumptions by performing linear regression analysis only thus limiting their predictive performance since they do not capture nonlinear relationships within genotypes and between genotypes and phenotypes .
Machine learning algorithms provide a new avenue to model noisy data and have been suggested as an attractive, non-linear alternative to traditional statistical models according to literature (LeCun, et al., 2015;Singh et al., 2016;Islam, 2017). They have shown to be superior to conventional statistical models. This is attributed to their ability 4 to model complex and non-linear relationships among acoustic parameters without having to fulfil the strict assumptions required by the conventional parametric model (Favaro et al., 2014). They also have ability to account for non-additive effect in data and can automatically "learn" complex relationships from the training dataset, without pre-defined rules . The focus of this study was, therefore, to compare the predictive ability of machine learning using deep convolutional neural network (DeepGS), conventional neural network (Artificial neural network) and conventional statistical predictive model ridge regression best linear unbiased (RR-BLUP) model in predicting body weight (BW) of indigenous chicken based on genome-wide markers.

Results
The results on Figure 1 show the regression-based predictive ability for body weight based The absolute errors between observed and predicted phenotypic values for the four models are presented in Figure 2. The results showed that DeepGS and Ensemble models recorded lower p-values of 1.192e-4 and 4.65e-4, respectively, on paired sample t-test compared with 0.0746 for rrBLUP and 0.0506 for ANN.
Mean separation was also done to ascertain if the means for body weight obtained from the four models were significantly different from the actual BW mean using the least significant difference (LSD) at alpha = 0.05, the results showed that the means for the 5 four models were not significantly different as presented in Table 1.  The reasons why DeepGS model performed in a similar way or sometimes worse than RR-BLUP is that allele frequencies of genotypes are always unbalanced (Bellot at al., 2018) and to a large extend the coding system adopted, which always takes a simple {0 1} or {0 1 2} format. The nature of data, the trait type being predicted and the network architecture also play a crucial role in determining the output of machine learning model, this is because most of the data are noisy which will intern affect the outcome (Bellot at al., 2018). For an instant the PCC on ANN in this study was 0.812 which were way far much higher and contrary to 0.378 reported by Ma et al. (2018) , this may be as a result of different algorithm adopted, input data set and data splitting criteria. Okut et al. (2013) reported correlation training set ranging from 0.776 to 0.858 almost similar to what this study reported after using different training algorithms, different activation functions and different numbers of neurons in hidden layers. Therefore, there is a need for a study to explore how related are chosen markers to functional regions related to the trait being studied since the use of unknown or non-specific markers will lead to the introduction of noise into the dataset (Besic et al., 2017).
The PCC resuts showed that combining DeepGS and RR-BLUP (ensemble model) resulted in better performance compared to RR-BLUP corresponding to 0.892 and 0.889 respectively. This is because validation dataset of ensemble uses a particle swarm optimization (PSO) algorithm, which has the capability of parallel searching on very large spaces of candidate solutions, without making assumptions about the problem being optimized (Kennedy and Eberhart, 1995

Conclusions
Even though the literature demonstrate that machine learning is a powerful tool in predicting phenotypes compared to conventional models, this study finds out that the margin of difference between the tested models was negligible and there was no statistical difference between the four models. The small differences observed might have arisen due to the nature of data used, different training algorithm, network architecture and a number of traits being analyzed. Therefore, DeepGS, Ensemble, RR-BLUP and ANN models can be used interchangeably in making phenotypic predictions.

Data source
The data used for this study were obtained from a wide-genome study done by Yuan et al.

Phenotyping
Live body weight (BW) was measured at hatch and every week until 15 weeks of age.

Sample collection, DNA extraction and genotyping
Blood samples were collected from 394 birds at 15 weeks of age. The DNA extraction was done using the phenol-chloroform method and diluted to 50 ng/ml. Genotyping was performed using Illumina 60K Chicken SNP BeadChips (Groenen et al., 2011). Quality control was conducted on all birds (after quality control of their phenotypic records) across four breeds by customized scripts in R software version 3.3.0 (R Core Team, 2007).
Single-nucleotide polymorphisms (SNPs) filtering were done using the following criteria: individual samples were excluded with call rates < 0.9 and minor allele frequency (MAF) < 0.05. After imposing the quality control checks, a total of 46211 SNPs remained.

Genotype preprocessing and coding
The 46211 markers had some missing genotypes therefore, missing markers were imputed using A.mat function of the RR-BLUP package installed in R software. Markers with 50% missing genotypes were not imputed, leaving 26698 markers for data analysis. Genotypes were coded as {0 1 2} based on R code script by Eva KF Chan (http://www.evachan.org/rscripts.html)

Conventional Neural Network
The conventional neural network model to be adopted for this research were a feedforward artificial neural network (ANN). The data were splited into training, validation and test data with 80% used for training, 10% being used for validation and remaing 10% used for testing. Sampling were random in order to avoid any selection bias in the dataset. The Prediction (testing) was done with the remaining data sample that were not used during the training and validation phase. Data preprocessing and analysis were done using the R software, while the construction and training of neural networks were performed using MATLAB. A three-layed, fully-connected, feed-forward artificial neural network were built using a MATLAB software function "feedforwardnet", with a 18-32-1 architecture (18 nodes in the first hidden layer, 32 nodes in the second hidden layer and one node as an output layer) . Error optimization will be done using back-propagation algorithm (Rumelhart, et al., 1986). Nodes in one layer will be fully connected to all nodes in the next layer to form a multilayer ANN architecture.

Conventional statistical model (Ridge Regression-Based Linear Unbiased Prediction (RR-BLUP)
Ridge Regression-Based Linear Unbiased Prediction (RR-BLUP) were adopted as a conventional statistical model. This model was built using the standard linear regression formula, given the genotype matrix as Z (n × p) and the corresponding phenotype vector as Y (n × 1); where, n = individuals, p = markers, μ = mean of phenotype vector Y, ɡ (p×1) = a vector of marker effects and (n × 1) = vector of random residual effects. The model was implemented using the function of mixed.solve in R software.

Deep convolutional neural network model (DeepGS)
The DeepGS model was constructed using the deep learning technique as described in a project done by Ma et al. (2018) available at GitHub (https://github.com/cma2015/DeepGS). Where An 8-32-1 architecture with one input layer, one convolutional layer with 8 neurons, one sampling layer, 3 dropout layers, 2 fully-connected layers and one output layer were constructed. The input layer receives the genotypic markers of a given individual in the 1 × p matrix, where p is the number of genotypic markers. The first convolutional layer filters the input matrix with 8 kernels (weights) which are each 1 × 18 in size with a stride size of 1× 1, followed by a 1 × 4 max-pooling layer with a stride size of 1 × 4. The output of the max-pooling layer is passed to a dropout layer with a rate of 0.2. The first fully-connected layer with 32 neurons is used after the dropout layer to join together the convolutional characters with a dropout rate of 0.1. Rectified linear unit (ReLU) activation function is applied in the convolutional and first fully connected layers. First fully-connected layer output is then fed to the second fully-connected layer with one neural and a dropout rate of 0.05. Using a linear regression model, the output of the second fully-connected layer is finally connected to the output layer which presents the predicted phenotypic value of the analyzed individual.
The Deep GS model was trained on the training set and validated on the ten-fold crossvalidation. Parameters were optimized with the back-propagation algorithm, by setting the number of epochs to 6,000, the learning rate to 0.01, the weight decay to 0.00001 and the momentum to 0.5. The loss function minimized will be taken as the mean absolute error (Mae) index ; (see Formula 1 in the Supplementary Files) where, m is the number of individuals in the training dataset, and predict k and obs k represent the predicted and observed phenotypic values of the k th individual, respectively. Deep GS were implemented using the central processing unit (CPU) based on Deep Learning framework MXNet.

Ensemble model
An ensemble GS model (E) was constructed using the ensemble learning approach by linearly combining the predictions of DeepGS (D) and RR-BLUP (R), using the formula described by Ma et al. (2018).
Predict E = (W D × predict D + W R ×predict R ) / (W D + W R ) where WD = weight of DeepGS ,WR = weight of RR-BLUP, predict D = predictions of DeepGS (D), predict R = prediction of RR-BLUP For each fold of cross-validation, parameters (W D and W R ) were optimized on the corresponding validation dataset using the particle swarm optimization (PSO) algorithm.

Statistical analysis
For each BW trait of a single breed, boxplots were generated by R version 3.3.0 (R CoreTeam, 2017) to screen for outliers. The Pearson's correlation coefficient (PCC) and corresponding significance level (p-0.05) will be calculated using R programming language for each model. The significance level of the difference between paired samples will be examined using the student's t-test using R software.
Abbreviations Figure 2 Box plot of the absolute errors between the observed and predicted phenotypic values.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.