GpNet: Genomic Prediction Network Using Locally Connected Layers in Korean Native Cattle

Background: The use of DNA marker information for the prediction of genetic merit in animal and plant breeding, and susceptibility to disease in human medicine has become widespread. Therefore, an increasing number of methods have been proposed for more accurate and eﬃcient genomic prediction. However, most of the commonly used models for genomic prediction only account for additive eﬀects since most of them are designed based on the linear model. Results: Here, we proposed a GpNet, a deep learning network for genomic prediction in Korean beef cattle. With a locally connected layer, GpNet can estimate LD-block eﬀects of single nucleotide polymorphisms (SNP) with adjacent two or more SNPs closer to 3’-end. This operation is quite similar to how the DNA sequence is used in the translation process in which the RNA polymerase interprets DNA sequence by units of codons to downstream (3’ to 5’). GpNet archived a superior performance than previous state-of-arts methods for beef carcass weight with a predictive ability of 0.721%. GpNet also found two signiﬁcant quantitative trait locus (QTL) on the regions (bta bta for carcass weight. However, GpNet showed less performance than linear methods in backfat thickness and eye-muscle area. Conclusions: GpNet outperformed the previous state-of-arts methods for beef carcass weight. However, GpNet cannot achieve superior performance in backfat thickness and eye-muscle area. We noticed that the lack of ability to estimate distant epistasis and dominance was the weakness of GpNet. Therefore, it remains a future research issue to expand GpNet to resolve these ﬂaws and this further study will accelerate the new phase of the genomic prediction.


Background
The use of DNA marker information for the prediction of genetic merit in animal 2 and plant breeding, and susceptibility to disease in human medicine has become 3 widespread. This genomic information has been utilized primarily to detect regions 4 of the genome that have an association with a specific phenotype (genome-wide 5 association studies -GWAS) or to predict the genetic merit and phenotypes of 6 individuals (genomic prediction) with many thousands of DNA markers, most com-7 monly single nucleotide polymorphisms (SNP), covering the entire genome. In hu- 8 mans, genomic prediction has been widely used to predict disease risk and highly 9 polygenic complex human traits [1,2]. In agriculture, genomic prediction was used 10 to estimate genomic breeding values (gEBV) which are then used to make selection 11 decisions in a breeding population. 12 Most of the commonly used models for genomic prediction have been proposed 13 based on the linear mixed models [3,4]. Genomic best linear unbiased prediction 14 (GBLUP) uses a mixed model approach which approximates a traditional infinites- 15 imal model and assumes all SNP contribute a non-zero value to the genetic vari- 16 ance [4]. It is a method that simply uses a genomic relationship matrix built from 17 the genotypes instead of a traditional pedigree-based relationship matrix. Bayesian 18 linear model assumes that some SNPs have zero effects, whereas others have small 19 to moderate effects and uses the posterior distributions to the parameters of linear 20 mixed model [3,5]. Even though these methods showed the state-of-the-art perfor- 21 mance on many populations, they only account for additive effects, since most of 22 them are designed based on the linear model. Thus, extended methods to account 23 for non-linearity effects, such as dominance and epistatic interactions, have been 24 proposed recently [6,7]. 25 Deep learning is also a good alternative method to solve this problem. Recent ad- 26 vances in deep neural networks have outperformed the state-of-the-art in computer 27 vision, natural language processing, and audio recognition tasks [8,9,10,11]. Using 28 the local information of the input features, like image RGB-channel, text, or audio 29 sequence, accelerated the successes of deep neural networks. Convolutional neural 30 network (CNN), which is the most successful deep learning structure in computer 31 vision, constitutes weights-shared filter operation for the adjacent region of input 32 image [12]. Recurrent neural network (RNN) has been commonly used in sequence to sequence problems, such as speech to text or natural language processing, gener-34 ating a new sequence of the specific time by using the information before that time 35 of sequence [10]. These two networks hypothesize that the regions showing similar 36 patterns in the input data could explain the similar features with each other. As 37 shown in Fig 1(a), features in the image (e.g. hair, eye, nose, glass, and so on) 38 have similar RGB-color patterns within the same features. Speech sound also has a 39 similar frequency pattern with other similar sounds (Fig 1(b)). Interestingly, the local information also can be addressed in the genomic predic-41 tion. The general concept of genomic prediction relies on the linkage disequilibrium 42 (LD) between genetic markers and the unknown quantitative trait loci (QTL). With 43 high-density SNP panels, the markers co-segregate with the causal mutations allow-44 ing their genetic effects to be indirectly estimated through the adjacent SNPs [3,13].

45
Considering this attribute of SNP data, genomic prediction model should estimate 46 the effects of each LD-block consisting of locally adjacent two or more SNPs not a 47 single SNP for the more accurate prediction. However, unlike the image and sound 48 data, the LD-blocks even with the same SNP pattern do not always have the same 49 effect on the individual traits. For the SNP data, it is more important to recognize 50 how close each LD-block biologically to the unknown QTL than the SNP pattern.

51
Therefore, a different approach from the previous deep learning networks, such as 52 CNN or RNN, is required to use local information for genomic prediction. Prac-53 tically, the simple fully-connected networks that didn't use the local information 54 usually showed better performance than other local-based networks in previous 55 studies [14]. In addition, Zingaretti et al. [15]  In this study, we proposed the Genomic prediction Network (GpNet) using a lo-63 cally connected layer for genomic prediction in Korean native cattle. The locally 64 connected layer works similarly to the causal convolution, except that weights are 65 unshared, that is, a different set of weights is applied at each different LD-block. We 66 validated the performance of GpNet as follow processes. First, the GpNet perfor-67 mances were evaluated on carcass weights, backfat thickness, and eye-muscle area of 68 Korean native cattle, and then its performance was compared with the GBLUP [4], 69 BayesA [3], and BayesLASSO [18]. Second, we also identified the candidate QTL 70 region using LD-block effects estimated by GpNet for each trait. Since there are 71 few results that deep learning outperformed the linear method, this study will be a 72 very interesting attempt in the field of genomic prediction.

74
Model performance 75   who maintained that a marker subset may cause the missing heritability even though 92 the variants in subset can explain a large proportion of genetic variance. To compare the QTL mapping of GpNet, we also estimated the SNPs effect using testis [24]. EMA seemed to have a similar genetic structure with CWT (Fig 2(c)).

108
These results seemed to be due to the genetic correlation between EMA and CWT.

109
In our data, EMA showed the 0.546 correlation with CWT. We can see that variant 110 on bta 27:23040097, which was not identified at CWT, was significant to EMA.

111
This variant is close to DLC1, which plays a key role in the regulation of small 112 GTP-binding proteins.
113 Table 3 shows the QTL region identified by both GpNet and SMLMM. In the 114 results, SLIT2 seemed to be a key gene for complex traits of Korean beef cattle weight [25,26], body weight [27], and fertility [28].
119 Table 3 Predictive ability of each model with LD-pruned SNP. of SNPs and d is depths of locally connected layer (Fig 3). On the other hand, di-138 lated convolution effectively allows the networks to estimate epistasis with a much 139 large distance since it only requires the two shared parameters per layer (Fig 4). Dominance also contributes to the total genetic potential for the phenotype 146 (Fig 5(a)). A nonlinear activation function, which is a critical part of the design 147 of a neural network, can allow such networks to compute nontrivial dominance.

148
GpNet adopts relu as a nonlinear activation function. However, relu is not suit-149 able for identifying dominance since it is still linear for positive values. Instead of 150 relu, transformed sigmoid or tanh would be good options for calculating dominance 151 (Fig 5(b)).

164
In this paper, we presented a GpNet, a deep learning network for genomic predic- GpNet consists of the stacks of multiple locally connected layers (Fig 7). Both skip 218 connection [11] and relu activation [32] are used throughout the network to enable shrinking y with h 2 , y will be converged to the gEBV at the end of model training.

237
The variance component, σ 2 G and σ 2 P , were estimated using an average information 238 restricted maximum likelihood [33] by implementing the AIREMLF90 program [34].
239 Table 4 shows the variance estimation results for each trait.

252
Finding QTL with LD-block effects.

253
In addition to predict gEBV, GpNet also can be used for estimating the LD-block 254 effect of each SNP. As the layer feedforwards to the next layer, GpNet accumulates 255 the LD-effect of the SNP on the i-position (x i ) as follows:  animals with high-gEBV and low-gEBV were noted (Fig 8). We hypothesized that 269 the difference in gEBV ranking between these two groups (high and low) would be 270 reflected by the difference in LD-block effects of each individual. Therefore, we did 271 a t-test for finding a significant region as follows: 272 where Sig i is significant value of i-position; H xi ∈ R 1000 and L xi ∈ R 1000 are LD-274 block effects of SNP x i in the high-gEBV group and low-gEBV group, separately.   Visualization of the locally connected layer. n is the number of input SNPs, d means layer depths. Since weights are unshared at locally connected layer, the number of parameters at d-depth layer is (d + 1)n Figure 4 Visualization of dilated convolution. n is the number of input SNPs, d means layer depths. Since weights are shared at dilated convolution, the number of parameters at each layer is two.    GpNet architecture. LCL is locally connected layer; d and s is the layer depths and the number of stacks; Relu is relu activation; GpNet can be scaled with d and s for different traits Figure 8 The process of QTL mapping.