Residual networks without pooling layers improve the accuracy of genomic predictions

Key message Residual neural network genomic selection is the first GS algorithm to reach 50 layers, and its prediction accuracy surpasses previous algorithms. Abstract With the decrease in gene sequencing cost and the development of deep learning, phenotype prediction accuracy by genomic selection (GS) has been continuously improved. Residual networks, a widely validated deep learning technique, are introduced to deep learning for GS. Since each locus has a different weighted impact on the phenotype, strided convolutions are more suitable for GS problems than pooling layers. Through the above technological innovations, we propose a deep learning algorithm for GS, residual neural network genomic selection (ResGS). ResGS is the first neural network to reach 50 layers in GS. In 15 cases with four public data, the prediction accuracy of ResGS is higher than that of ridge-regression best linear unbiased prediction, support vector regression, random forest, gradient boosting regressor, and deep neural network genomic prediction in most cases. ResGS performs well in dealing with gene-environment interaction. Phenotypes from other environments are imported into ResGS along with genetic data. The prediction results are much better than just providing genetic data as input, which demonstrates the effectiveness of GS multi-modal learning. Standard deviation is recommended as an auxiliary GS evaluation metric, which could improve the distribution of predicted results. Deep learning for GS, such as ResGS, has increasingly apparent advantages compared with traditional GS algorithms.


Introduction
With the development of breeding technology and the advent of big data, the focus of breeding research has shifted from the phenotype to the molecular level.Genomic Selection (GS) is a breeding method that estimates genomic breeding value (GEBV) based on marker information from the entire genome (Meuwissen et al. 2001).The current research methods for GS mainly include the Best Linear Unbiased Prediction (BLUP) method, the Bayes method and machine learning.In the BLUP series models, the Genomic BLUP (GBLUP) model based on genomewide markers successfully recognizes complex patterns (VanRaden et al. 2017).In most cases, the accuracy of GBLUP is better than that of traditional BLUP methods (Misztal et al. 2017).
The random effect in the Genomic feature BLUP (GFBLUP) is increased to 2, which makes the model more flexible (Edwards et al. 2016).Different genetic markers in GFBLUP are weighted differently, which improves the accuracy of genome prediction of complex traits.
The ridge regression BLUP (RRBLUP) is a widely researched and robust linear algorithm for GS (Rice et al. 2019;Endelman et al. 2011).The penalty introduced in RRBLUP is applied to all markers tagging both trim-and large-effect genomic components, avoiding that phenotype predictions are too affected by a small number of markers.BLUP models generally have the advantages of stable calculation results, short calculation time, relatively simple principles, and strong interpretability.Since BLUP models are linear, the prediction accuracy could be more competitive.To improve the accuracy of genome prediction, the Bayes model is brought into the field of GS.The BayesA and BayesB models, subject to the prior distribution, improved the partial phenotype prediction accuracy to 0.85 (Meuwissen et al. 2001).Models such as Bayes LASSO, BayesC, BayesCπ, and BayesDπ continue improving accuracy and speed (Park et al. 2008;Lu et al. 2020;Habier et al. 2011).BayesA is a good choice when dealing with actual data.From the perspective of computing time, BayesCπ is better than BayesDπ.BayesA has the longest calculation time.The performance of the Bayes algorithm often depends on how well its prior assumptions are consistent with the actual situation.
Different phenotypes require different prior distributions.However, determining an appropriate prior distribution is not an easy task.
Machine learning is a method of data analysis that automates analytical model building.
Currently, the machine learning methods supporting GS include Random Forest (RF) model, Support Vector Machine (SVM) model, stochastic gradient boosting (boosting) and so on (Ogutu et al. 2011).RF regression uses regression models on tree nodes rooted in bootstrapping data.In an R package for genomic selection named BreedWheat Genomic Selection pipeline (BWGS), RF gives the highest predictive accuracies of 0.543 among 14 methods (Charmet et al. 2020).RF can capture nonadditive effects, which makes it seem promising (Heslot et al. 2012).SVR is an algorithm with clear mathematical principles.The support vector regression (SVR) method aims to discover a hyperplane in an n-dimensional space, where n denotes the number of features or independent variables (Kavitha et al. 2016).
The highest accuracy of GS for alfalfa biomass yield in different reference populations was obtained by SVR (Annicchiarico et al. 2015).SVM, GBLUP, and BayesR compete in genomic prediction in pig and maize populations (Zhao et al. 2020).SVM is a competitive method in GS.However, it is not significantly better than the other two algorithms.Gradient boosting regressor (GBR) is an ensemble learner for regression.It produces the results based on a collection of individual models (Cai et al. 2020).In the bull dataset, GBR obtained the best prediction correlation coefficient among GBR, Bayes B, GBLUP, RF, CNN, and multilayer perceptron (MLP) (Abdollahi-Arpanahi et al. 2020).Machine learning algorithms can improve computational efficiency and provide higher prediction accuracy than traditional GS models.
For the "big p small n" problem (n represents the number of samples, and p is the number of SNPs), the machine learning optimization algorithm is used to solve the problem, and the calculation efficiency of the whole process is very high (Fitzpatrick et al. 2021).
In recent years, deep learning (DL) has achieved great success in image, natural language processing and content generation (Tian et al. 2020;Otter et al. 2020;Summerville et al. 2018).
In the field of GS, deep learning genomic selection (DeepGS) based on DL is an early attempt to predict biological phenotypes (Ma et al. 2017).Compared with RRBLUP, DeepGS yields a relative improvement ranging from 1.44% to 65.24%.Deep neural network genomic prediction (DNNGP) is superior to most GS methods (Wang et al. 2023).Some guides on DL for complex trait genomic prediction have appeared in the literature (Zou et al. 2019;Pérez-Enciso et al. 2019).First, compared to multilayer perceptron networks (MLP) and recurrent neural networks (RNN), convolutional neural networks (CNNs) appear as the most promising predictive tool.Second, we cannot expect significant improvements with DL in this field.
Third, fewer than five layers are adopted in most genomics applications (Zhou et al. 2015;Kelley et al. 2016).In most literature, max-pooling layers are consistently utilized for compressing information.Maximum pooling, or max pooling, is a pooling operation that calculates the maximum or most enormous value in each patch of each feature map, which is very effective for computer vision tasks such as image classification.
In this paper, we introduce residual structure and strided convolution to the field of GS, which significantly improves the existing GS deep learning model.Residual connections are one of the most successful network structures in modern DL (Szegedy et al. 2017;Xie et al. 2017;Luo et al. 2020).A key finding is that pooling layers are a terrible way to compress information in the GS domain.The pooling layers are the main reason DL could have worked better in the GS field.Strided convolution is where the kernel moves across the input gene with a specific step size or 'stride'.The strided convolution implements compression through gene weights.This stride determines the output size and can compress the input, reducing computational complexity.We propose a hybrid algorithm, residual neural network genomic selection (ResGS), to solve the genomic prediction.The most significant difference between ResGS and previous DL algorithms in the GS field is the introduction of residual structures.
The residual structure can deepen the neural network and improve the model's predictive ability simultaneously.Combining the above innovative technologies, ResGS is the first time to build a 50-layer neural network model in the GS field.Multi-modal learning has been applied to gene-environment interactions and achieved good prediction results.

Plant materials
To test the prediction effect of ResGS on different phenotypes, we performed genomic prediction on four public data.The maize301dataset is the built-in data of TASSEL software (Bradbury et al. 2007).The maize301 is a medium-sized dataset with 301 maize plants and 3093 markers.It contains three phenotypic traits: days to pollination (Dpoll), ear diameter (EarDia), and ear height (EarHT).Due to the absence of phenotypes in some plants, the number of maize plants with Dpoll,EarDia and EarHT are 276,249 and 279,respectively.The rice395 dataset comes from the literature (Zhao et al. 2010), with 395 diverse accessions of Asian rice.These rice species belonged to 5 subpopulations (indica, aus, tropical japonica, temperate japonica, and GroupV).There are 1536 SNPs in rice395, 1311 of which have high-quality scores.This dataset records two complex traits: amylose content (AC) and seed length (SL).
The rice413 dataset has 44,100 SNP markers and 34 traits (Zhao et al. 2011).Four hundred thirteen diverse accessions of Asian rice are from 82 countries.Through a modified mixed model, the p-values for the 34 phenotypes of rice413 are given in the literature.At the same time, these results are open source.In this paper, the genomic prediction of alkali spreading value (ASV), amylose content (AC), panicle number per plant (PNPP), protein content (PC), seed length (SL) and seed number per panicle (SNPP) in rice413 will be performed.AC and SL are the predicted phenotypes in both rice395 and rice413.Compared with rice395, rice413 contains more SNP sites and more phenotypes.The rice395 and the rice413 are small-size and large-size datasets, respectively.
The wheat599 is the CIMMYT wheat public dataset built into the BGLR software (Pouladi et al. 2015).In wheat599, there are 599 wheat samples with 1279 markers.It contains a trait in four different environments, namely env1, env2, env3, and env4.

Genetic data encoding and preprocessing
The genotype comes from VCF format files.VCF can contain genotype information on samples for each position.For a marker with two alleles, A and a, the genotype score is defined as 2, 1 and 0 for the AA, Aa and aa genotypes, respectively.Wheat is allohexaploid, and its genotype scores are assigned only 1 and 0. The missing genotype is coded as -1.The phenotypes involved in this paper are quantitative traits.Both genotype data and phenotype data are set as floating-point numbers in the program.This paper does not perform principal component analysis (PCA) processing on genetic data because deep learning encourages endto-end learning on raw feature data.A random seed is used when dividing the data into training and testing sets to ensure reproducible results.
Data preprocessing methods and phenotype prediction algorithms are performed using Python, including the scikit-learn machine learning library and TensorFlow library.ResGS is implemented by keras, a Tensorflow high-level API.

The General principle of ResGS
The sample size is critical to model selection (Vabalas et al. 2019;van Smeden et al. 2019;Portet et al. 2020).There is no optimal algorithm when the sample size is less than a thousand (Dosovitskiy et al. 2020;Alwosheel et al. 2018;Liu et al. 2017).As the number of samples gradually increases, small neural networks outperform traditional machine learning algorithms (Fig. 1A).Large neural networks tend to perform best when the amount of data is large enough, which has led to the popularity of big models (Chen et al. 2020;So et al. 2019).
Unfortunately, the GS problem is generally on the left side of Fig. 1A.In a typical GS problem, the number of samples is generally in the hundreds or even only a few dozen.Previous studies have found that different phenotypes require different algorithms for prediction (Abdollahi-Arpanahi et al. 2020;Liu et al. 2020).
The calculation process of ResGS is shown in Fig. 1B.First, biological phenotypes are predicted by four traditional machine learning algorithms: RRBLUP, SVR, RF, and GBR.
RRBLUP and SVR are parametric and geometric methods, respectively.RF and GBR are two ensemble learning methods.These four algorithms are pretty different and perform well in specific phenotype predictions.No single algorithm predicts the best outcome across all phenotypes.In ResGS, the algorithm that best predicts a particular phenotype is selected from these four algorithms.Unlike previous DL directly predicting phenotypes, ResGS predicts phenotypic residuals (Fig. 1B).The phenotypic residual is the difference between the observed phenotype and the predicted phenotypic value based on the best traditional machine learning method.Predicting phenotypic residuals is a more straightforward task than predicting phenotypes.Because the range of phenotypic residuals is often significantly smaller than the range of phenotype, predicting phenotypic residuals by DL is a way to stand on the shoulders of giants of traditional methods.As long as DL is not negatively impacted, ResGS improves predictions compared to traditional methods.
In ResGS, an elementary and powerful residual block is the basis of the model (Fig. 1C).
There are two branches in the residual block.One branch passes through a convolutional layer with a stride of one, a Relu layer and a batch normalization layer.The other branch is called a shortcut, which keeps the input unchanged.The output is the sum of the two branches.All operations are joint and have corresponding calculations in the literature (Wang et al. 2020;Long et al. 2019;Amin et al. 2020).In practice, residual connections have proven to be a very effective technique for improving neural networks (Liu et al. 2019;Li et al. 2019).Before the advent of residual blocks, only about twenty layers of neural networks could be trained.After the residual connection appears, we can train a thousand-layer neural network (He et al. 2016).
Introducing the residual network can significantly alleviate the problem of gradient disappearance, which is the most crucial reason why the residual structure works so well.The follow-up calculation results of this paper prove that the residual block is also effective in GS.In the previous literature, DL in GS performed information compression through maxpooling.Only the most significant marker works (Fig. 2A).In the image field, max-pooling has proven to be an efficient information compression technique.Therefore, the researchers introduced max-pooling to GS without proof.Modern DL often compresses information through max-pooling, average-pooling or strided convolution.Three compression methods correspond to three situations.In max-pooling, only one gene affects several adjacent genes.
In average pooling, adjacent genes play the same role.In CNN, the convolution kernel is the operator of the convolution operation.In GS, the gene encoding vector and the convolution kernel perform a vector inner product operation.The amount of movement on the kernel to the input vector is called "stride".The default stride value is 1.The following discussion ignores edge effects.When the stride is 1, gene information is not compressed.When the stride is n and n is greater than 1, the genetic information is compressed to 1/n of the previous information.The information compression ratio of max-pooling, average-pooling, and convolution with a stride of two is 2. Initially, there are 6 position information (Fig. 2A).
Three position information is obtained Through three information compression methods.
Strided convolution compresses information according to weights.Which information compression method is most suitable in GS?
The p-value obtained by genome-wide association studies (GWAS) is usually considered as the weight of single-nucleotide polymorphisms (SNP) (Saini et al. 2020;Luo et al. 2021).
The most commonly accepted threshold is p < 5 × 10 −8 to classify associations as significant.
Each SNP can be calculated to obtain a p-value.Here we directly quote the calculation results of rice GWAS in the literature (Zhao et al. 2011).Then we could draw two conclusions (Fig. 2B).First, the weights of most markers are not zero, which indicates that the compression method of max-pooling is not suitable.Second, the weights between adjacent markers are not similar, which shows that average-pooling is not a reasonable compression method.The

The architectural details of ResGS
The algorithms of the prediction model in this paper include RRBLUP, SVR, RF, GBR, DNNGP, and ResGS.The scikit-learn library implements RRBLUP, SVR, RF, and GBR.
DNNGP is implemented according to the open-source code in the literature (Wang et al. 2023).
ResGS is composed of several basic modules.The residual block is the most critical module of ResGS, and it has been shown in Fig. 1C.Another essential module is the strided convolution block (Fig. 3A).The strided convolution block consists of three layers: convolutional layer, Relu layer, and batch normalization.The stride convolution block is divided into two types: the stride of 1 and the stride of 2. The primary purpose of the convolution block with a stride of 2 is to compress information and reduce the length of gene information.To minimize the information loss in the information compression, we expand the number of channels by 4 times in the convolution block with a stride of 2. In order to avoid drastic inflation of model parameters, the model needs to reduce the number of channels to normal levels again.A convolution block with a stride of 1 can meet the requirement of reducing the number of channels.These deep-learning operations are often used on images (Yepez et al. 2020).The difference is that the stride on the image is 1×1, and the stride in ResGS is 1.The dimensionality of the stride is reduced from 2 dimensions to 1 dimension.
The residual unit consists of a convolutional block with a stride of 2, a convolutional block with a stride of 1, and two residual blocks in sequence (Fig. 3B).The convolution block with a stride of 2 is to reduce the information length.The convolution block with stride 1 is to reduce the number of channels.The genetic information gradually approaches the phenotype through two residual blocks.In the previous literature, deep learning for GS mainly relies on two modules of maximum pooling and convolution kernel for modelling.In ResGS, strided convolutions replace pooling layers, and residual blocks replace convolutional layers.
Compared with the previous two modules, the strided convolution and residual block can significantly reduce the loss of genetic information during each layer operation.The above conclusions have been widely confirmed in deep learning research in other fields (Zaniolo et al. 2020).In this paper, it is found that it is also applicable to GS through genome prediction of maize, rice, and wheat.In the general literature, deep learning in the GS field can only train less than 10-layer networks, but DeepGS has as many as 50 layers.In deep learning, deeper models generally have higher prediction accuracy.Therefore, DeepGS surpasses the deep learning methods in the literature regarding phenotype prediction accuracy.
The first half of ResGS is a variety of convolution modules, and the second half is multiple fully connected networks (Fig. 3C).The SNP information of the plant is encoded into Traditional machine learning and ResGS each predict a value in the prediction process.
The sum of these two values is set as the predicted phenotype.

Prediction accuracy of ResGS in different phenotypes
We will analyze the phenotype prediction performance of ResGS on maize301, rice395, and rice413 (Fig. 4).At the same time, RRBULP, SVR, RF, GBR, and DNNGP are introduced to predict these phenotypes to compare the prediction accuracy of different algorithms.To stably evaluate these algorithms' prediction accuracy, 10-fold cross-validation is applied in this paper.
The Pearson correlation coefficient between the predicted and observed phenotypes is taken as the prediction accuracy.
In maize301, RRBULP outperforms three other traditional machine learning algorithms (SVR, RF, and GBR).The two neural network algorithms (DNNGP and ResGS) are obviously better than the previous four algorithms (Fig. 4A).The prediction accuracies of DNNGP for the three phenotypes of Dpoll, EarDia and EarHT are 0.78, 0.62, and 0.57, respectively.The phenotype prediction accuracies of ResGS are 0.78, 0.65, and 0.56, respectively.DNNGP is a relatively competitive algorithm.In maize301, ResGS and DNNGP predict phenotypes with similar accuracy.
Although rice395 is a small dataset, the phenotype prediction accuracy is not low (Fig. 4B).The prediction accuracies of RRBULP, SVR, RF, GBR, DNNGP, and ResGS for AC are 0.76, 0.72, 0.89, 0.88, 0.85, and 0.91, respectively.This time the predictions of RF and ResGS are the most accurate.When the prediction accuracy exceeds 0.9, the predicted and observed phenotypes almost always follow the same trend.SL is a shape phenotype that is controlled by multiple genes.The prediction accuracy of ResGS for SL can still reach 0.84.In AC and SL, ResGS is the best algorithm for genomic prediction.
The rice413 is a relatively large data set, and we performed genomic predictions for six phenotypes (Fig. 4C).ResGS performed first in six phenotype predictions.DNNGP has won second place four times, third place once and fifth place once.In most phenotype predictions, RRBULP, SVR, RF, and GBR were between third and sixth place.The predictive ability of these four algorithms is relatively close.ResGS is a method that first makes predictions with RRBULP, SVR, RF, and GBR.Then it predicts phenotypic residuals.Naturally, ResGS outperforms RRBULP, SVR, RF, and GBR on all phenotype predictions.Considering that the calculation amount of DNNGP is smaller than that of ResGS, the performance of DNNGP is quite good.From the perspective of prediction accuracy, ResGS still performs better than DNNGP.
The prediction results of 11 phenotypes in the three data sets are counted.The average prediction accuracies of RRBULP, SVR, RF, GBR, DNNGP, and ResGS are 0.63, 0.61, 0.66, 0.65, 0.71, and 0.75, respectively.SVR is the worst algorithm, and ResGS performs best.
Another interesting finding is that the algorithms on rice395 outperform the algorithms on rice413 in predicting both AC and SL.On rice395 and rice413, the prediction accuracies of ResGS for AC are 0.91 and 0.87, respectively.The prediction accuracies of ResGS for SK are 0.84 and 0.82, respectively.There are the same accessions of Asian rice in rice395 and rice413, and the same authors measured the two datasets.The rice413 contains more rice varieties and markers than rice395.However, more markers did not lead to higher prediction accuracy.The main reason for this phenomenon is the higher quality of the markers in rice395.More markers in rice413 bring about a more serious "big p small n" problem.For GS, improving marker quality is more important than increasing marker quantity.

Performance of ResGS in different environments
Organisms exhibit different phenotypes in different environments, even if the genotypes are identical.Adaptability in different environments is a critical evaluation criterion for the GS algorithm.We will examine the performance of RRBULP, SVR, RF, GBR, DNNGP, and ResGS in different environments on the wheat599 dataset.
In the first scheme, the sample data comprises feature X and target Y, where feature X is a gene sequence and target Y is the phenotype in an environment.This scheme is often adopted in GS, as shown in the upper part of Fig. 5A.The phenotype prediction results of the six GS algorithms are displayed through radar charts (Fig. 5B).The prediction accuracy of the algorithm in the outer ring is higher than that of the algorithm in the inner ring.Therefore, RRBULP is the worst algorithm in Scheme 1, and its average prediction accuracy in the four environments is only 0.32.DNNGP and ResGS are the two top-performing algorithms.In env1, env3, and env4, ResGS outperforms DNNGP.In env2, DNNGP overtakes ResGS.
Across the 4 environments, the average prediction accuracies of DNNGP and ResGS are 0.60 and 0.61, respectively.ResGS is slightly better than DNNGP.
For gene-environment interactions, there is another GS calculation scheme (the lower part of Fig. 5A).The target Y is still the phenotype in one environment.However, feature X comprises phenotypic traits from other environments and genomic data.The wheat599 contains the phenotypic traits of the same genotype wheat under four different environments.
Here we assume that we know the phenotypes of the wheat in the three environments and predict the phenotypes of the wheat in the remaining environments.Conventional normalization methods, such as Min-Max normalization, cannot be applied here.Because in those normalized ways, the status of each feature is considered to be similar.However, the number of GS gene features is far greater than the number of environments, and it is necessary to increase the weight of phenotypes in different environments.Since phenotypic and genetic data have different scales, phenotypic data needs to be multiplied by a factor in feature X.The above factor is a hyperparameter, and its appropriate value is obtained by grid search.Based on wheat599, the phenotype in env1 is predicted from env2, env3, and env4 phenotypes and genomic data.A grid search is performed in 1, 5, 6, 7, 8, 10, and 100.Furthermore, the factor is determined to be 7.
The prediction accuracy of different GS algorithms according to the new feature X is shown in Fig. 5C.In the radar chart, RRBULP and other algorithms are not interleaved, and obviously, RRBULP predicts the worst.The lines of RF, SVR and GBR are almost coincident, which shows that the prediction accuracy of these three algorithms is close.Only in env4 SVR is significantly better than RF and GBR, and their prediction accuracies are 0.65, 0.62, and 0.59, respectively.The ResGS line is in the outermost circle, and the DNNGP line is in the second outer circle.The prediction accuracy of ResGS in env1, env2, env3, and env4 is 0.77, 0.82, 0.79, and 0.73, respectively.Among the above two calculation schemes, ResGS is the algorithm with the highest prediction accuracy in most environments.
We compare the prediction accuracy of different algorithms under the two calculation schemes (Fig. 5D).The phenotype in env1 is predicted by gene data and "gene + phenotype", Comparison of prediction accuracy of GS algorithms in scheme 1 and scheme 2.

The improvement of ResGS compared to the traditional machine learning models
ResGS predicts the residuals of the phenotypes predicted by traditional machine learning algorithms.However, there are errors in the prediction results of traditional machine learning algorithms.ResGS does not necessarily have higher prediction accuracy than traditional machine learning algorithms.The prediction results of Dpoll in maize301, SL in rice395 and PC in rice413 are picked out for detailed analysis (Fig. 6).
RRBLUP is the best traditional machine learning algorithm for predicting Dpoll (Fig. 4A).In Pearson correlation coefficient will be close to 1.The prediction model adopts the following "lazy" strategy based on the above facts.These predictive models are very conservative because they are only willing to predict phenotypes around the phenotype average.Adding a small value to the phenotype average keeps the predicted and observed phenotypes on the same trend.With this strategy, the prediction accuracy is close to 1.However, the predicted and observed phenotypes are still quite different.Those mentioned "lazy" behaviour of predictive models is often observed in our experiments.This behaviour also occurs with GS models in the literature (Wang et al. 2023), but those authors seldom analyze this behaviour of their models.
Most phenotypes follow a normal distribution， which means that the number of samples near the phenotype average is large, and the number of samples far from the phenotype average is small.GS models tend to predict phenotypic mean values when the model employs a mean squared error (MSE) loss function.This phenomenon is very similar to the sample imbalance in classification problems.Even if the Pearson correlation coefficient is close to 1, the GS model is not guaranteed to be good.Two indicators, the Pearson correlation coefficient and standard deviation, are recommended to evaluate the GS model.The standard deviation of the predicted phenotype should be as close as possible to the standard deviation of the observed phenotype.In Dpoll, the standard deviations of observed, traditional machine learning predicted, and ResGS predicted phenotypes are 5.88, 0.909, and 1.15, respectively.
From the standard deviation perspective, the traditional machine learning model and ResGS perform poorly.The standard deviation of ResGS-predicted phenotypes was more extensive than that of traditional machine learning-predicted phenotypes.ResGS makes the distribution of prediction results wider.As for how to make the predicted results' standard deviation closer to the observed phenotype's standard deviation, it is left for future research to complete.
The prediction accuracy of the traditional machine learning model and ResGS for SL are 0.829 and 0.871, respectively (Fig. 6C).PC is the least predictable of the three phenotypes.
The prediction accuracy of the traditional machine learning model is only 0.576 (Fig. 6E).In

Effect of training set size on ResGS
The number of available phenotypic records in GS is also called the sample size.The sample size is the key factor restricting the prediction accuracy of different GS algorithms.The sample size is generally between hundreds and thousands, and the number of SNP variants can be between thousands and tens of millions.The number of SNP variants is often sufficient for

Discussion
With the development of DNA sequencing technology, understanding of complex traits is gradually increasing.The main methods for predicting complex traits include linear, Bayesian, and machine learning methods.These calculation methods generally have defects such as low prediction accuracy and unstable performance in different species and phenotypes.Deep learning is a high-precision algorithm that requires a large number of samples.In GS, the sample size is generally only a few hundred, which makes deep learning extremely easy to overfit.We propose ResGS, an algorithm that combines traditional machine learning and deep learning, which can reduce the dependence on the amount of data and improve the accuracy of phenotype prediction.layers to more than 200 layers, and the prediction accuracy of deep learning has also improved to a higher level.Except for this paper, we have not seen the application of residuals in GS.

The network architecture for genomic prediction
Third, many settings from image CNN are arbitrarily introduced to CNN in GS.Max pooling helps to remove the effect of image shaking, so it has achieved great success in images.
However, max pooling also has the risk of directly ignoring the secondary features.We should not only care about one marker gene and completely ignore its neighbours.Some other settings, such as the kernel size value, the selection of the activation function, and the number channels, all directly follow the conventions in the image CNN.
We mainly make two successful attempts for deep learning for GS: residual connection and strided convolution.Residual connection provides another path for data to reach latter parts of the neural network by skipping some layers (Fig. 1C).It can alleviate gradient vanishing and thus improve GS prediction accuracy.In information compression, strided convolution can reduce information loss compared with pooling layers.The principle of strided convolution is to compress information according to the weight to preserve the genetic information to the maximum extent.With residual connections and strided convolutions, we successfully train ResGS with up to 50 layers.The prediction accuracy of ResGS also exceeds that of other algorithms.If more advanced deep learning architectures are introduced, the prediction accuracy of deep learning for GS can be further improved.

Comprehensive performance of ResGS
The relationship between genes and phenotypes is complex due to additive and non-additive gene effects.Humans also still do not fully understand how genes determine phenotype.Genes can affect phenotypes in many ways, and determining that a GS algorithm can predict phenotypes is impossible.One GS algorithm represents one mapping method from genes to phenotypes.Nevertheless, different phenotypes have different gene mapping patterns.
From the perspective of computational theory, no algorithm can outperform all other when there are only a few hundred samples.In machine learning, each sample is a learning opportunity for algorithms.The learning results of the algorithm are stored in its architecture and parameters.When there are only a few hundred samples, an extensive neural network with a unique architecture is too late to converge all its parameters to the appropriate value.Different algorithms are equivalent to different mapping modes of gene-to-phenotype.
Appropriate algorithms can reduce dependence on sample size.DNNGP is an excellent GS algorithm, but RF is better when predicting ASV (Fig. 4C).
ResGS is an algorithm that combines RRBLUP, SVR, RF, GBR, and neural networks to overcome the lack of samples.First, it finds an algorithm most suitable for the current phenotype among RRBLUP, SVR, RF, and GBR.This optimal algorithm predicts phenotypes.
ResGS predicts the prediction error of the above algorithm, also known as the residual.
Among the 6 algorithms compared in this paper, ResGS is the most stable.In most cases, ResGS is also the algorithm with the highest prediction accuracy (Fig. 4 and Fig. 5).In GS, specific phenotypes are analyzed case by case.Alternatively, a hybrid algorithm, such as ResGS, is better.

Multi-modal learning for GS
Genetic data are stitched together with phenotypic data from other environments when predicting phenotypes across environments (Figure 5A).It is clear that phenotypic and genetic data are two completely different data, and we can call them two modalities.This paper introduces multi-modal learning for GS, which also performs much better than single-modal learning (Fig. 5D).
Multi-modal learning involves interaction with many different inputs at once.An example of multi-modal data is data that combines text with visual and auditory data.This multi-modal data is far better than text (Srivastava et al. 2012).Multi-modal learning is a hot topic in deep learning, which has significantly succeeded in large models (Zhang et al. 2018;Zhang et al. 2019).Due to the input of multiple modalities, multi-modal learning obtains much richer information.The multi-modal model easily outperforms the unimodal model.The multi-modal model gradually has the functions of summarization, sentiment analysis, question answering, and machine translation.As far as we know, GS still needs to apply multi-modal learning.
GS in multi-environment has always been a research hotspot because genes and environments mainly determine biological phenotypes.Gene data is the sequence of adenine (A), cytosine (C), guanine (G), and thymine (T).Environmental factors mainly comprise meteorological data, such as temperature, rainfall, and sunshine time.The phenotypes from other environments are added to the input, as shown in this paper.Genes, environmental data, and phenotypes are the three modalities.In this paper, we tried two-modal learning in GS, and the prediction results improved significantly (Fig. 5).Multi-modal learning is a promising research direction for dealing with genetic by environmental interaction (G × E).

Evaluation metric in GS
The default evaluation metric in GS is the Pearson correlation coefficient.The Pearson correlation coefficient is always between -1 and 1, which is intuitive and convenient for comparison of prediction results between different phenotypes.In GS, the Pearson correlation coefficient generally falls between 0 and 1.The closer it is to 1, the better the prediction.
Conversely, when the Pearson correlation coefficient is close to 0, the predicted and observed phenotypes are linearly independent.Biologists are more concerned with the correctness of predicted trends than minor improvements in prediction accuracy.The predicted trend affects the Pearson correlation coefficient, making it a more suitable evaluation metric than the relative error in GS.
algorithm, and we suspect this overestimation is a common phenomenon in the literature.
More test samples can objectively evaluate the prediction accuracy of the GS algorithm.
datasets.All original code has been deposited on GitHub and is publicly available.A link to the code is available in the text of the paper.

Fig. 1
Fig. 1 Schematic of ResGS.A The relationship between model selection and the amount of data.Due to the number of data, there is no optimal algorithm in GS.B Calculation process of ResGS.ResGS is a hybrid algorithm.C Schematic of the residual block.Residual blocks can slow down the vanishing gradient, thus making the network deeper.

Fig. 2
Fig. 2 Comparison of Three Compression Methods in GS.A Three methods of compressing information.In max pooling, only the most significant marker works.Each marker works equally in average pooling.In the strided convolutions, information can be compressed by weight.Rice genome-wide association studies show that the strided convolution is most suitable for GS.B Genome-wide association scan for alkali spreading value, panicle number per plant, and protein content.The p-value of each marker is not 0, and they are not close to each other.The p-values in Fig.2Bare drawn based on the calculation results of this literature(Zhao et al. 2011).

a 1 -CFig. 3
Fig. 3 The architectural details of ResGS.A Strided convolution block.It comprises a convolutional layer, a Relu activation layer, and batch normalization.B Residual unit.It consists of a strided convolution block with stride=2, a strided convolution block with stride=1, and two residual blocks.C Process architecture of ResGS.ResGS is a 50-layer medium-sized neural network.The loss function of ResGS is a mean squared error (MSE).The Adam and Nadam optimizers outperformed the other optimizers during our training.The ResGS optimizer is Adam or Nadam, and the performance of these two optimizers is relatively close.The batch size is set to 64.The batch size should be smaller if the computing graphics memory is less than 20Gb.The training will be terminated when the prediction accuracy does not improve for 100 consecutive epochs.Due to the small number of training samples, the model may be stuck in a local optimum.To avoid this problem, we set different initialization parameters for ResGS.The best calculation result is regarded as the global optimal prediction of ResGS.Other details of ResGS are in the open-source code, and the URL is given before.

Fig. 4
Fig. 4 Prediction accuracy of six GS algorithms in different phenotypes.A The prediction accuracy of six algorithms for Dpoll, EarDia, and EarHT in maize301.B The prediction accuracy of six algorithms for AC and SL in rice395.C The prediction accuracy of six algorithms for ASV, AC, PNPP, PC, SL and SNPP in rice413.

Fig. 5 C
Fig. 5 Prediction accuracy of six GS algorithms in different environments.A Two schemes of GS in multiple environments.In scheme one, only genes are used as input, equivalent to unimodal learning.In scheme two, genes and other environmental phenotypes are used as input, which is two-modal learning.B Performance of six GS algorithms in four environments.C Performance of six GS algorithms based on scheme two in four environments.D Fig. 6A, the red points are the prediction result of RRBLUP, and the blue points are the result of ResGS.The black straight line is the result of perfect prediction.The test sample size is 27.ResGS is equivalent to fine-tuning traditional machine learning prediction results, so the red dot and the blue triangle are not far apart.Linear fitting is performed on the phenotypes predicted by traditional machine learning and ResGS, respectively, and the fitting equations are obtained: y = 0.107x-60.1 and y = 0.157x-55.4.The slope of the straight line fitted by ResGS is more extensive, which means that the range of phenotypes predicted byResGS is more extensive.The Pearson correlation coefficients of traditional machine learning and ResGS are 0.695 and 0.801, respectively.ResGS improved the prediction results of traditional machine learning; in this case, the prediction accuracy increased by 0.106.Since the accuracy in Fig.4Ais the average result of 10-fold cross-validation, the results here are slightly different from those in Fig.4.In this case, the phenotypes predicted by ResGS are all smaller than those predicted by traditional machine learning.This kind of bias happens occasionally in ResGS.Is ResGS good at predicting Dpoll?From Pearson's correlation coefficient perspective, ResGS's prediction accuracy 0.81 is quite good.However, neither traditional machine learning nor ResGS is good enough in terms of predicting the distribution of phenotypes (Fig.6B).The minimum, 25% quantile, median, 75% quantile, and maximum values of observed Dpoll are58.5, 62.75, 67.5, 70.75, and  81.0, respectively.The distribution of observed Dpoll is a flat normal distribution.The phenotypes predicted by traditional machine learning algorithms and ResGS obey a spiky normal distribution.The minimum and maximum values predicted by traditional machine learning are 65.63 and 69.37, respectively.The minimum and maximum values predicted by ResGS are 64.08 and 68.20, respectively.Their ranges are much smaller than the observed Dpoll range.As long as the phenotype prediction trend is correct, the

Fig. 6
Fig. 6 Comparison of prediction results between ResGS and traditional machine learning models.A Comparison of Dpoll predictions by traditional machine learning algorithms and ResGS.B Observed, traditional machine learning algorithm predicted, and ResGS predicted Dpoll distributions.C Comparison of SL predictions by traditional machine learning algorithms and ResGS.D Observed, traditional machine learning algorithm predicted, and ResGS predicted SL distributions.E Comparison of PC predictions by traditional machine learning algorithms and ResGS.F Observed, traditional machine learning algorithm predicted, and ResGS predicted PC distributions.

Fig. 7
Fig. 7 Effect of training set size on ResGS.A Two schemes for evaluating the effect of training set size.In Scheme 1, the testing set size is 10% of the sample size.In Scheme 2, the test set is the remaining records in rice413.B The relationship between ResGS prediction accuracy and training set size in scheme 1.When the training set size is the smallest, the ResGS prediction accuracy is the highest.C The relationship between ResGS prediction accuracy and training set size in scheme 2. As the training set size increases, the prediction accuracy of ResGS increases.
(Abdollahi-Arpanahi et al. 2020;rk architecture dPérez-Enciso et al. 2019)ction.In the literature(Abdollahi-Arpanahi et al. 2020; Zou et al. 2019;Pérez-Enciso et al. 2019), MLP, CNN, and RNN for GS are compared in detail.It concludes that CNN is the best network architecture of the three.Some CNN computing techniques, such as max pooling and batch normalization, are directly introduced to GS without carefully verifying these calculation methods.With many defects, deep learning for GS is still in the early application stage.First of all, translation invariance and equivariance are default requirements in CNN.The image is ideally aligned with the translation invariance, but SNP does not conform.We guess that CNN performs relatively well in GS because it has many channels.Each channel can obtain a part of the information in the gene.So the number of channels is significant in deep learning for GS.At least CNN is not a very suitable network architecture for GS.In the previous literature(Zhou et al. 2015; Kelley et al. 2016), CNN was not much better than traditional machine learning when predicting complex phenotypes.We need a neural network architecture designed for GS to meet the GS characteristics.This network architecture should be able to identify linkage disequilibrium (LD) and calculate SNP weights.Secondly, many neural network computational skills verified in other fields have not been introduced into GS.For (Vaswani et al. 2017)rmerSo et al. 2019signed network in recent years, which is based on a self-attention mechanism(Vaswani et al. 2017).The Transformer has achieved state-of-the-art results in natural language processing, image, speech, and more(Dosovitskiy et al. 2020;So et al. 2019).So far, the application of the Transformer in GS has rarely been reported in the literature.Residual connections are another one of the most successful attempts in deep learning.After introducing residual connections, the neural network has increased from 20