GSCNN: A genomic selection convolutional neural network model based on SNP genotype and physical distance features and data augmentation strategy

doi:10.21203/rs.3.rs-3991262/v1

Download PDF

Research Article

GSCNN: A genomic selection convolutional neural network model based on SNP genotype and physical distance features and data augmentation strategy

https://doi.org/10.21203/rs.3.rs-3991262/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Genomic selection (GS) proves to be an effective method for augmenting plant and animal breeding efficiency. Deep learning displays remarkable flexibility and vast capacity for representation, enabling it to capture complex associations, and is deemed one of the most auspicious models for GS.

Methods

The present study proposed a deep-learning technique named genomic selection convolutional neural network (GSCNN) that introduces innovation in three aspects. GSCNN encodes adjacent single nucleotide polymorphisms (SNPs) using the genotypes and physical distance (PD) between SNPs, allowing more accurate determination of the complex associative relationships of SNPs. Secondly, we generate new samples by perturbing SNP sequences based on chromosomes to solve the data scarcity problem and improve the performance of the GS deep learning model. In addition, the GSCNN uses advanced deep learning techniques - Bidirectional Encoder Representation from Transformers (BERT) embedding and attention pooling - to interpret biosequence information.

Results

Compared to widely used GS models, such as genomic best linear unbiased prediction, reproducing kernel Hilbert space, Bayes B, Bayesian lasso, and deep learning genome-wide association study, the GSCNN demonstrated superior performance in six prediction tasks.

Conclusion

The GSCNN is a promising model for GS and provides a reference for applying deep learning to other life science fields.

genomic selection

prediction method

deep learning

data augmentation

Genomic selection (GS), initially proposed by Meuwissen et al., is a method that employs genome-wide genotype markers to predict the breeding values of an unobserved population, thereby expediting the identification of superior genotypes and accelerating the breeding cycle [1, 2]. Traditional statistical models, including the ridge regression best linear unbiased (rrBLUP), genomic best linear unbiased prediction (GBLUP), and the Bayesian A, B, and Bayesian lasso (BL), are widely used in GS [3]. These methods make distributional assumptions about the effects of single nucleotide polymorphisms (SNPs) [1, 3]. Nevertheless, the precise influence of individual SNPs remains uncertain and may not strictly conform to any specific distribution. Moreover, traditional statistical models fail to capture complicated associations between SNPs, which is particularly relevant in complex diseases or traits resulting from epistasis [4, 5].

Deep-learning models offer nonparametric flexibility and powerful representation capabilities that enable them to address these challenges. A convolutional neural network (CNN)-recurrent neural networks (RNN) model was introduced by Khaki et al. for crop yield prediction, which outperformed all other tested methods with a root-mean-square error (RMSE) of only 9% and 8% of the respective average yields for corn and soybean [6]. Ma et al. compared the predictive accuracy of six models on eight wheat traits and found that CNN, rrBLUP, and GBLUP were the top three performing models, with Pearson correlation coefficients of 0.742, 0.737, and 0.731, respectively [7]. A deep learning genome-wide association study (DLGWAS)outperformed traditional statistical methods and some deep-learning models on simulation and soybean datasets [8]. Even though current deep learning models perform well in GS, several obstacles persist.

First, the physical characteristics of SNPs on the chromosome is a key factor in understanding the relationship between SNPs and association with traits, yet it is disregarded by current GS models [9–11]. The correlated R² between adjacent SNPs for different buffalo populations reduced rapidly with increasing physical distance from 100 kb to 1 mb [11]. In individuals of British descent from diverse backgrounds, PD explains about 45% of the variance in linkage disequilibrium (LD), and the remaining diversity could be linked to DNA's physical traits causing regional variations in mutation or recombination rates [9]. PD could be crucial in the effective identification of disease-causing genes through LD mapping [9]. Therefore, disregarding physical feature of SNPs would result in information loss, thereby affecting the accuracy of the predictions made.

Second, most deep learning models assume that there is sufficient data for optimizing a tremendous number of weight parameters in neural networks. When data is scarce, data augmentation emerges as an effective solution. In the realms of Natural Language Processing (NLP) and Computer Vision (CV), extensive research has been conducted on augmenting text and image data to enhance model performance and robustness [12]. However, the exploration of augmentation methods for biological sequence data remains relatively limited. Li et al. proposed a phage–host interaction prediction method (PHIAF) that employs a generative adversarial network (GAN) to ensure data augmentation that generates high-quality synthetic samples [13]. A generative adversarial networks-based method (FFPred-GAN) is proposed to generate high-quality synthetic protein feature samples, which significantly enhance the accuracy on predicting all three domains of Gene Ontology terms [14]. By randomly substituting or inserting arbitrary amino acids based on peptides were collected from the UniProt databese, the accuracy for identifying peptides were significantly improved [15]. Cao et al. augmented the data by treating the reverse complement DNA sequence as another sample to improve the DNA–protein binding prediction performance [16]. To date, there is a lack of reported augmentation strategies for SNP data in the literature.

Emerging advanced deep-learning techniques constantly enhance model performance. For example, attention pooling, which is a new pooling technique, dynamically assigns weights to different positions of the input sequence [17]. Attention pooling extracts more comprehensive information from input sequences compared to max pooling and average pooling, which are commonly used in existing GS deep-learning models. BERT embedding, a new word-embedding technique in natural language processing, can capture diverse semantic knowledge associated with a given word across various contextual domains. It has also recently been demonstrated to possess significant potential for interpreting biosequence information [18]. Therefore, advanced deep learning techniques can ensure the model performs well.

In this study, we propose the genomic selection CNN (GSCNN) deep learning model. The GSCNN encodes SNPs by combining genotype and PD with the next SNP and applying advanced techniques. An augmentation strategy specific to SNP data was adopted in the GSCNN to avoid overfitting. The findings demonstrate that the GSCNN demonstrated superior performance compared to five baseline models across two datasets, indicating its high generalizability, efficiency, and accuracy for GS.

Datasets

This study used three datasets with varying numbers of SNPs as well as different levels of trait heritability and population diversity to assess the accuracy of the model’s predictions and its ability to generalize. The soybean dataset from the “SoyNAM” R package contains 5,487 recombinant lines from 40 populations and 4,401 SNPs [19]. Four agronomic traits were evaluated: maturity, height, lodging, and protein. The missing genotypes were imputed using a forward Markov chain algorithm, with the missing loci being filled in with the most likely genotype according to the previous marker. The sorghum dataset contained 724 lines from four populations, with 9,139 high-quality SNPs RH Higgins, CS Thurber, I Assaranurak and PJ Brown [20]. Two traits were evaluated: plant height in Urbana, Illinois, USA (HT-IL) and flowering time in Urbana, Illinois, USA (FL-IL). The heritability of these six quantitative traits was obtained using GCTA software [21]. The heritability of the soybean’s maturity, height, lodging, and protein as well as the sorghum’s HT-IL and FL-IL are 0.91, 0.91, 0.78, 0.82, 0.929, and 0.931 respectively.

Encoding

Each SNP is represented by two letters: the first letter represents the genotype of the SNP, with the major alleles being denoted by H, the heterozygous alleles being represented by L, and the minor alleles being indicated by M; and the second letter denotes the PD or LD coefficient R² between the SNP and the next SNP. The PD is classified based on two criteria: 1) 0–10 kb, 10–20 kb, 20–50 kb, and > 50 kb; and 2) 0–100 kb, 100–200 kb, 200–500 kb, and > 500 kb, which are denoted by Y, J, W, and K, respectively. LD is classified based on two criteria: 1) 0–0.2, 0.2–0.5, 0.5–0.7; and 0.7–1; and 2) 0–0.7, 0.7–0.8, 0.8–0.9; and 0.9–1, which are denoted by Y, J, W, and K, respectively. The last SNP on each chromosome is denoted by N. In the control group, all but the last SNP on each chromosome have a second letter of J. LD was calculated by Plink [22].

Data augmentation

Existing models arrange SNPs in chromosome numbering order as the model’s input [6–8, 23]. Due to the numbering order lacking biological significance, we propose a data augmentation strategy in which the sequence is rearranged based on chromosome units as a new sample (Fig. 1a), with the two augmentation methods being online and offline augmentation. Online augmentation refers to augmentation that is carried out while training, with a new training set being generated at the beginning of each training step (Fig. 1b). The training set generated from online augmentation maintains the same size as no augmentation, but without repeating for each epoch. Offline augmentation refers to augmenting data before training, and the training set for offline augmentation consists of the augmentation set and the original training set, which is repeated for each epoch (Fig. 1c). We investigated the differential impact of online and offline augmentation on the GSCNN, with the phenotype data being augmented in a corresponding manner. Moreover, the augmentation was only performed on the original training set.

The GSCNN

The GSCNN’s architecture comprises three key components: 1) embedding layer; 2) CNN layer; and 3) linear layer (Fig. 2a).

The embedding layer is divided into two parts: 1) the SNP sequence embedding, which uses BERT embedding; and 2) the population structure embedding, which includes the sequential operations of Embedding, Batch Normalization (Batch Norm), Attention Pooling (AttentionPool), Dropout, Batch Norm, and Gaussian Error Linear Unit. The CNN layer is composed of a dual CNN stream, with the input for one of the CNN streams being SNP embedding, for which a small kernel size is used to capture detailed variations in the SNP sequences. The other stream incorporates the combination of SNP sequence embedding and population structure embedding as input, with a large kernel size being employed to capture high-level conceptual representations [24]. The details of the convolution tower are shown in Fig. 2b. Attention pooling computes softmax weights for each channel (Fig. 2c), as employed in Enformer [25]. The linear layer contains a linear module and a dropout module.

Training and evaluation

We randomly split the original data into a ratio of 80%:20% ten times (random seeds: 0–9), in which 80% was the training data and the remaining 20% was the testing data. The GSCNN was implemented in pytroch 11.6, and we used the AdamW optimizer (weight_decay = 0.0002) and CosineAnnealingLR scheduler (max_epoch = train_epoch). The GSCNN model was trained on a training and testing batch size of 32, and a learning rate of 0.0012, 30 epochs for soybean data, 200 epochs for sorghum data. The average coefficient of determination (R²) of the ten tests was defined as the prediction accuracy.

Other models for baseline comparison

To demonstrate the effectiveness of the GSCNN, we compared it to five state-of-the-art models, including three linear models and one nonlinear model. GBLUP, a linear model, predicts breeding values for all genotyped individuals according to the genomic relationship matrix [26]. GBLUP was implemented through the “rrBLUP” package [27]. Furthermore, the reproducing kernel Hilbert space (RKHS), which is a linear model, was implemented through the “BGLR” R package, which is a typical semi-parametric method that employs the Gaussian kernel function to fit the model [28]. The Bayes B, a linear model, was implemented through the “BGLR” R package, based on the Monte Carlo–Markov chain (MCMC) strategy with 1,500 iterations and 500 burn-ins [27]. In addition, the BL, a linear model, was implemented through the “glment” R package based on the MCMC strategy with 1,500 iterations and 500 burn-ins [28]. These models provide more precise predictions of breeding values, which are determined by additive effects. Conversely, the DLGWAS, a nonlinear model, is a dual-stream model [8]. The first CNN stream contains two feed forward CNN layers with kernel sizes of 4 and 20, and the second stream contains a single CNN layer with a kernel size of 4. The feature-processing block includes one CNN layer that integrates all the outputs and processes the genotype features. The output-processing block consists of a flattening layer and a single neuron for generating the final predicted phenotypes.

Interpretability

Interpreting deep-learning models implies translating the intricate mathematical principles acquired by neural networks into biological principles, thereby offering novel insights into biology. To biologically trace the decisions of the GSCNN to SNPs, we computed the attribution scores by multiplying the gradient by the input ($\text{I}\text{n}\text{p}\text{u}\text{t}\times \text{G}\text{r}\text{a}\text{d}\text{i}\text{e}\text{n}\text{t}$) [29]. Due to the input of the BERT embedding layer not being differentiable, we derived the prediction contribution score by using the output of the BERT embedding as an input for differentiation.

$$\text{I}\text{n}\text{p}\text{u}\text{t}\ast \text{G}\text{r}\text{a}\text{d}\text{i}\text{e}\text{n}\text{t} = {\text{x}}_{\text{i}}\frac{{\partial \text{S}}_{\text{c}}\left(\text{x}\right)}{\partial {\text{x}}_{\text{i}}}$$

${\text{x}}_{\text{i}}$ is the input to unit i, $\frac{{\partial \text{S}}_{\text{c}}\left(\text{x}\right)}{\partial {\text{x}}_{\text{i}}}$ is the partial derivatives of the output of unit c with respect to the input to unit i. The contribution score of each SNP was the normalized average of its contribution scores across ten tests, with a Manhattan plot being generated according to the contribution scores.

Influence of the PD between the SNPs on the GSCNN’s performance

In the GSCNN, each SNP is represented by two letters that represent the genotype and PD or the LD with the next SNP. The PD or LD are represented by different letters according to different classification methods. As presented in Table 1, the addition of either PD or LD to the encoding improves the average prediction accuracy of GSNN on six traits compared to encoding SNP with genotype only. When encoding SNP with genotype and narrow-classified PD (classification: 0–10 kb, 10–20 kb, 20–50 kb, and > 50 kb), the GSCNN achieved the highest average prediction accuracy for the six traits (0.6496), thus exhibiting improvements of 2.28%, 1.88%, 1.26%, and 1.18% compared to encoding SNP with genotype only, genotype and LD (classification: 0–0.2, 0.2–0.5, 0.5–0.7, and > 0.7), genotype and LD (classification: 0–0.7, 0.7–0.8, 0.8–0.9, and > 0.9), genotype and wide-classified PD (classification: 0–100 kb, 100–200 kb, 200–500 kb, and > 500 kb), respectively.

Table 1

The GSCNN’s prediction accuracy according to different SNP encodings.
SNP encoding	Soybean maturity	Soybean height	Soybean lodging	Soybean protein	Sorghum HT-IL	Sorghum FL-IL	Mean
I	0.5927	0.562	0.4467	0.5491	0.8406	0.8194	0.6351
II	0.5976	0.5699	0.4507	0.5529	0.8346	0.8196	0.6376
Ⅲ	0.6011	0.5839	0.4483	0.5644	0.8300	0.8212	0.6415
Ⅳ	0.5950	0.5773	0.4500	0.5544	0.8462	0.8289	0.6420
Ⅴ	0.6059	0.5880	0.4575	0.5662	0.8525	0.8274	0.6496

I, Only genotype; II, genotype with LD (LD classification: 0–0.2, 0.2–0.5,0.5–0.7,and > 0.7); Ⅲ, genotype with LD (LD classification: 0–0.7, 0.7–0.8, 0.8–0.9, and > 0.9); Ⅳ, genotype with PD (PD classification: 0–100 kb, 100–200 kb, 200–500 kb, and > 500 kb); Ⅴ, genotype with PD (PD classification: 0–10 kb, 10–20 kb, 20–50 kb, and > 50 kb).

Effects of augmentation on the GSCNN’s performance

Based on the most effective classification method for PD according to our results (0–10 kb, 10–20 kb, 20–50 kb, > 50 kb), we evaluated the impact of different augmentation strategies on the GSCNN. We proposed a novel approach to augmenting biological sequence data in which the order of chromosomes is shuffled and its effectiveness on the two datasets is then evaluated (Tables S1). The relative change in the GSCNN’s percentages according to different augmentation strategies compared to no augmentation is shown in Fig. 3. For online augmentation, the prediction accuracy of all traits, except for soybean lodging and protein, decreased compared to no augmentation. The decline was particularly notable for sorghum data. Conversely, the offline augmentation strategy improved all trait prediction accuracies. For soybean maturity, lodging, and protein, the GSCNN’s prediction accuracy improved the most when the training set was offline augmented five-fold (1.55%, 5.55%, and 2.86%, respectively). For soybean height, sorghum HT-IL and FL-IL, GSCNN's prediction accuracy gradually improved with the increase in offline augmentation fold. The highest improvement was observed with eight-fold offline augmentation (6.4%, 5.54%, and 2.49%, respectively). The average prediction accuracies of the six traits without augmentation, with online augmentation, and with offline two-fold, offline five-fold, and offline eight-fold augmentations were 0.6496, 0.6025, 0.6616, 0.6723, and 0.6740, respectively. Therefore, offline augmentation was more powerful than online augmentation.

The GSCNN’s prediction accuracy and its comparison to baseline methods

We evaluated the prediction accuracy (R²) of the GBLUP, RKHS, Bayes B, BL, DLGWAS, and GSCNN (offline eight-fold) on two datasets: soybean and sorghum (Fig. 4, Table S2). GSCNN achieved the highest prediction accuracy for soybean maturity, height, lodging and sorghum HT-IL, and slightly worse than Bayes B and BL for soybean protein and sorghum FL-IL. The average prediction accuracy on six traits of GSCNN (0.6740) improved by 16.73%, 16%, 3.10%, 3.47%, and 56.08% compared to GBLUP (0.5778), RKHS (0.5779), Bayes B (0.6542), BL (0.6518), and DLGWAS (0.4321) respectively.

Ablation study

These results demonstrate the effectiveness of the GSCNN, which is inseparable from its design. The BERT embedding in the GSCNN can dynamically adjust the word vector based on the different contexts in which the same word appears, enabling it to learn higher-quality word embeddings. The attention pooling in the GSCNN enables the derivation of effective representations for complex trait prediction, which necessitates the analysis of higher-order statistics that may not be easily achieved through average pooling or max pooling [30].We evaluated the performance of the following GSCNN variants: GSCNN-B, indicating a variant that uses embedding instead of BERT embedding; GSCNN-A, denoting a variant that uses average pooling instead of attention pooling; and GSCNN-M, referring to a variant that uses max pooling instead of attention pooling. For computational efficiency, no data augmentation strategy was used in any of the models. As shown in Table 2, the prediction accuracy of the GSCNN in terms of soybean maturity, height, loading, protein, and sorghum HT-IL and FL-IL was 0.6059, 0.5880, 0.4577, 0.5662, 0.8525 and 0.8274, respectively. GSCNN achieves the highest accuracy for all six traits, thus indicating the important role of BERT embedding and attention pooling.

Table 2

The prediction accuracy of the GSCNN and its variants.
Model	Soybean maturity	Soybeanheight	Soybean lodging		Sorghum HT-IL	Sorghum FL-IL	Mean
GSCNN-B	0.6037	0.5709	0.4550	0.5628	0.8490	0.8265	0.6446
GSCNN-A	0.5738	0.5576	0.4441	0.5560	0.8449	0.8097	0.6310
GSCNN-M	0.5917	0.5602	0.4449	0.5626	0.8393	0.8167	0.6359
GSCNN	0.6059	0.5880	0.4575	0.5662	0.8525	0.8274	0.6496

Interpretability of the GSCNN

Refining the explanation behind a deep learning model’s predictions can be just as critical as the prediction accuracy in GS. One of the most popular intelligence approaches is $\text{I}\text{n}\text{p}\text{u}\text{t}\times \text{G}\text{r}\text{a}\text{d}\text{i}\text{e}\text{n}\text{t}$, which we used to calculate each SNP’s contribution to the prediction (Supplemental file1, Supplemental file2) [29]. As depicted in Fig. 5, the majority of the SNPs exhibited similar predictive power, with only a small subset demonstrating greater influence. Therefore, we identified these high-impact SNPs as potential loci for subsequent association analyses.

SNP Gm04_29528926 is near the E8, which is involved in the time of flowering and maturity locus of soybeans [31]. Glyma.02g058500, near SNP Gm02_5360523, is a homolog of INDETERMINATE DOMAIN1/ENHYDROUS, which promotes seed maturation by regulating light GA effects and ABA signaling in Arabidopsis [32]. Substantial evidence supports the strong correlation between plant height and lodging in soybeans, and our experimental results also support this conclusion as we observed a similar curve pattern in the Manhattan plot of the SNPs’ contribution scores for plant height and lodging, with high-contributing SNPs being concentrated on chromosomes 13 and 19, showing significant overlap [33, 34]. The copy numbers of Glyma.13G287600 and Glyma.13G288000, near Gm13_38408846, were negatively correlated with trailing growth and shoot length, which determine plant height and, in turn, affect lodging [35]. Gm08_45695835 is near the significant region of soybean protein on Gm08 from 45.5 to 46.9 Mb [36]. In terms of sorghum, S7_58504205, S7_55229509, and S9_57192617 are near the significant SNPs associated with plant height: Dw3, qHT7.1, and Dw1, respectively [37].

Adding the PD feature to SNP encoding improves the GSCNN’s prediction accuracy more than LD feature

Predicting quantitative traits through SNP genotypes forms the cornerstone of many GS models. Beyond genotype considerations, literatures have extensively explored the impact of various factors on prediction accuracy. The study conducted by Scutari revealed a linear decay in the correlation between true and predicted values in two common measures of genetic distance [38]. Both LD and Co-segregation (CS) have been established as crucial sources of information contributing significantly to the accuracy of genomic prediction [39–41]. Historical LD information persisted across generations, playing a role in enhancing prediction accuracy across diverse families and validation generations [39]. Modeling CS explicitly results in improved genomic prediction accuracy, particularly in cases where there is low historical LD between Quantitative Trait Loci (QTL) and SNPs [41]. Ren et al. constructed a different SNP panel with certain SNP numbers according to PD and found that the smaller the variation degree of PD between adjacent SNPs, the higher the prediction accuracy [42]. They has utilized PD information as a foundation for constructing SNP panels with varying densities, rather than modeling it directly [42].

The incorporation of PD information into SNP encoding in GSCNN is expected to enhance prediction accuracy by enabling more accurate determination of relationships between SNPs, as these relationships heavily rely on PD. The impact of three SNP encoding strategies on prediction accuracy was evaluated: genotype-only, genotype combined with LD, and genotype combined with PD. Our findings suggest that including either adding LD or PD information in SNP encoding improved prediction accuracy, but adding PD information resulted in greater improvement. This may stem from the fact that LD calculation relies heavily on genotype frequencies, which inherently contain a substantial amount of redundant information with genotypes [22]. Nevertheless, GSCNN's complex network structure empowers it to directly capture more precise SNP relationships through PD features. The impact of classifying LD and PD in different ways was further investigated. For both LD and PD, employing classification methods that assist the model in identifying closely linked SNPs results in higher prediction accuracy. In conclusion, PD information proves more advantageous in enhancing the model's performance compared to LD, especially when PD is categorized within a narrow range.

A novel augmentation strategy for SNP data

Larger datasets lead to improved performance of deep-learning models, but limited availability of data remains a major obstacle in the application of deep learning models in GS [43]. Data augmentation represents a valuable strategy for addressing data scarcity issues, as it encourages deep-learning models more inclined to abstract generalized representations rather than spurious correlations [12]. At present there are two main types of data augmentation methods for biological sequences: GAN-based augmentation and sequence transformation-based augmentation [13–16]. To obtain high-quality augmented data using GAN-based methods, adjusting hyperparameters and enduring long training times are necessary but time-consuming and computationally intensive tasks. Additionally, the restricted length of sequences generated by a GAN poses a challenge in terms of its application to SNP data analysis. The use of reverse complementary sequences as additional samples is not suitable for SNP data augmentation strategies as SNP genotypes already provide information from both DNA strands.

The ideal data augmentation strategy should possess the dual characteristics of being easily implementable and highly effective. Herein, we proposed a data augmentation strategy that generates new sample by reordering the chromosomes, as their numbering order lacks biological significance. By perturbing the order of chromosomes, on one hand, the sample size is increased to mitigate overfitting, while on the other hand, any two chromosomes may become adjacent, thereby enhancing the model's capacity to capture SNP relationship across different chromosomes, such as long-range LD [44]. We compared two augmentation strategies: online and offline augmentation. Offline augmentation refers to applying augmentation before the training, necessitating significant storage capacity [12]. In contrast, online augmentation does not require additional storage space as the operations are performed in real-time [12]. Online augmentation enables the generation of theoretically larger datasets, consequently introducing a higher level of noise [12]. Online augmentation showed much better efficacy when applied to the soybean dataset than to the sorghum dataset. This might be explained by the fact that sorghum data has a lower sample/feature ratio, rendering it more susceptible to noise. Offline augmentation is more powerful than online augmentation in our study, and the enhancement in model performance becomes more pronounced with an increasing number of augmentation folds. The augmentation strategy we propose is particularly well-suited for SNP data, offering a high level of operability and effectiveness compared to previously proposed augmentation strategies [13, 16].

Advanced deep learning techniques improve GS prediction accuracy

Deciphering the hidden instruction of SNP sequence has been one of the major goals of GS model. The rapid advancement in deep learning has accelerated the decoding process. Since the distinct heritability of SNPs with the same genotype and the existence of long-range LD, SNP sequence share similarities with key properties of natural language: polysemy and distant semantic relationships, respectively [44, 45]. To better model SNP sequences as a language, we adopt BERT embedding to distinguish polysemous SNPs from contextual information and detection distant relationships. BERT embedding includes token embedding, segment embedding, and position embedding, which are used to generate word vectors, differentiate sentences, and learn the order properties of input, respectively [18]. Due to the letter represent PD or LD serve as segment embedding inputs, GSCNN identifies adjacent SNPs with the same PD or LD as having similar semantics, similar to the LD block. By incorporating a weighted combination of these three embeddings, BERT embedding in the GSCNN captures the semantic similarity between SNPs and determines the target SNPs’ effects across diverse contexts, thereby surpassing the limitations imposed by traditional static embeddings [18].

To achieve a fixed-length vector and reduce spatial size, a pooling function is employed on the feature map. Attention pooling was proposed to address limitations of the commonly used pooling strategies. According to the results of the ablation experiments (Table 2), replacing attention pooling with average pooling in the GSCNN leads to a larger decrease in prediction accuracy than when it is replaced with max pooling. Unlike max pooling and attention pooling, average pooling fails to preserve the most significant features of the data [46]. Attention pooling allows the model to gather information from all locations in the input sequence, not just local maxima, by learning global attention weights [30]. These findings demonstrate that both important features and global information play crucial roles in genomic prediction.

The GSCNN outperformed the baseline models, indicating its strong generalization ability and robustness. The incorporation of PD feature into SNP encoding has been demonstrated to surpass the utilization of LD feature in enhancing the prediction accuracy of GSCNN. Furthermore, we have shown that data augmentation by reordering chromosomes to generate new samples is effective and easy to implement. This strategy can be readily coupled with other deep learning GS models and opens the door to bio-sequence data augmentation. However, some limitations must be addressed. First, the optimal fusion mode of SNP genotype and PD is not yet exhaustive and requires further discussion, and second, due to the BERT embedding, a long sequence will consume many computing resources and limit the scenarios in which the GSCNN can be applied. In conclusion, the GSCNN can benefit crop breeding and other sequence-based prediction tasks in bioinformatics.

BERT Bidirectional Encoder Representation from Transformers

BL Bayesian lasso

CNN convolutional neural network

CS Co-segregation

GAN generative adversarial network

GBLUP genomic best linear unbiased prediction

GS genomic selection

GSCNN genomic selection convolutional neural network

LD linkage disequilibrium

MCMC Monte Carlo–Markov chain

PD physical distance

rrBLUP ridge regression best linear unbiased

RKHS reproducing kernel Hilbert space

RMSE root-mean-square error

RNN recurrent neural networks

SNPs single nucleotide polymorphisms.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare no conflicts of interest.

Consent for publication

Not applicable.

Funding

This study was supported by Special Funds for Construction of Innovative Provinces in Hunan Province (2021NK1011) and the Science and Technology Innovation Program of Hunan Province (2023NK2001).

Data availability

The source code and datasets analyzed during the current study are available in the https://github.com/luxixi2021/GSCNN.

Author contributions

L.J. methodology, formal analysis, writing – original draft, writing - review and editing. W.H. writing - review & editing. L.X. data curation and visualization. H.Z. performed formal analysis. C.L. performed normal analysis. Z.Y. conceptualization and funding acquisition. L.L. conceptualization,writing, review and editing, supervision and funding acquisition. All author have read and agreed to the published version of the manuscript.

Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001; 157:1819-29.
Li L, Zheng X, Wang J, Zhang X, He X, Xiong L et al. Joint analysis of phenotype-effect-generation identifies loci associated with grain quality traits in rice hybrids. Nat Commun. 2023; 14:3930.
Crossa J, Pérez-Rodríguez P, Cuevas J, Montesinos-López O, Jarquín D, de Los Campos G et al. Genomic Selection in Plant Breeding: Methods, Models, and Perspectives. Trends Plant Sci. 2017; 22:961-75.
Johnson MS, Reddy G, Desai MM. Epistasis and evolution: recent advances and an outlook for prediction. BMC Biol. 2023; 21:120.
Webber C. Epistasis in Neuropsychiatric Disorders. Trends Genet. 2017; 33:256-65.
Khaki S, Wang L, Archontoulis SV. A CNN-RNN Framework for Crop Yield Prediction. Front Plant Sci. 2019; 10:1750.
Ma W, Qiu Z, Song J, Li J, Cheng Q, Zhai J et al. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta. 2018; 248:1307-18.
Liu Y, Wang D, He F, Wang J, Joshi T, Xu D. Phenotype Prediction and Genome-Wide Association Study Using Deep Convolutional Neural Network of Soybean. Front Genet. 2019; 10:1091.
Abecasis GR, Noguchi E, Heinzmann A, Traherne JA, Bhattacharyya S, Leaves NI et al. Extent and distribution of linkage disequilibrium in three genomic regions. American journal of human genetics. 2001; 68:191-97.
Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ et al. Linkage disequilibrium in the human genome. Nature. 2001; 411:199-204.
Rahimmadar S, Ghaffari M, Mokhber M, Williams JL. Linkage Disequilibrium and Effective Population Size of Buffalo Populations of Iran, Turkey, Pakistan, and Egypt Using a Medium Density SNP Array. Front Genet. 2021; 12:608186.
Shorten C, Khoshgoftaar TM, Furht B. Text Data Augmentation for Deep Learning. Journal of Big Data. 2021; 8:101.
Li M, Zhang W. PHIAF: prediction of phage-host interactions with GAN-based data augmentation and sequence-based feature fusion. Brief Bioinformatics. 2022; 23.
Wan C, Jones DT. Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks. Nature Machine Intelligence. 2020; 2:540-50.
Lee B, Shin MK, Hwang IW, Jung J, Shim YJ, Kim GW et al. A Deep Learning Approach with Data Augmentation to Predict Novel Spider Neurotoxic Peptides. International journal of molecular sciences. 2021; 22.
Cao Z, Zhang S. Simple tricks of convolutional neural network architectures improve DNA-protein binding prediction. Bioinformatics. 2019; 35:1837-43.
Touvron H, Cord M, El-Nouby A, Bojanowski P, Joulin A, Synnaeve G et al. Augmenting Convolutional networks with attention-based aggregation. ArXiv. 2021; abs/2112.13692.
Le NQK, Ho QT, Nguyen TT, Ou YY. Am J Hum GenetA transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinformatics. 2021; 22.
Xavier A, Muir WM, Rainey KM. Assessing Predictive Properties of Genome-Wide Selection in Soybeans. G3 (Bethesda, Md). 2016; 6:2611-6.
Higgins RH, Thurber CS, Assaranurak I, Brown PJ. Multiparental mapping of plant height and flowering time QTL in partially isogenic sorghum families. G3 (Bethesda, Md). 2014; 4:1593-602.
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics. 2011; 88:76-82.
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015; 4:7.
Abdollahi-Arpanahi R, Gianola D, Peñagaricano F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet Sel Evol. 2020; 52:12.
Mishkin D, Sergievskiy N, Matas J. Systematic Evaluation of Convolution Neural Network Advances on the ImageNet. Comput Vis Image Und. 2017; 161.
Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021; 18:1196-203.
VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008; 91:4414-23.
Endelman JB. Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome. 2011; 4.
Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010; 33:1-22.
Shrikumar A, Greenside P, Kundaje A: Learning Important Features Through Propagating Activation Differences. In: Proceedings of Machine Learning Research; Proceedings of Machine Learning Research: Edited by Doina P, Yee Whye T. PMLR 2017: 3145-53.
Li P, Song Y, Mcloughlin I, Guo W, Dai L: An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition. In: International Speech Communication Association: 2018.
Cheng L, Wang Y, Zhang C, Wu C, Xu J, Zhu H et al. Genetic analysis and QTL detection of reproductive period and post-flowering photoperiod responses in soybean. Theor Appl Genet. 2011; 123:421-9.
Feurtado JA, Huang D, Wicki-Stordeur L, Hemstock LE, Potentier MS, Tsang EWT et al. The Arabidopsis C2H2 Zinc Finger INDETERMINATE DOMAIN1/ENHYDROUS Promotes the Transition to Germination by Regulating Light and Hormonal Signaling during Seed Maturation. Plant Cell. 2011; 23:1772-94.
Wang X, Li MW, Wong FL, Luk CY, Chung CY, Yung WS et al. Increased copy number of gibberellin 2-oxidase 8 genes reduced trailing growth and shoot length during soybean domestication. Plant J. 2021; 107:1739-55.
Keep NR, Schapaugh W, Prasad PVV, Boyer JE. Changes in Physiological Traits in Soybean with Breeding Advancements. Crop Sci. 2016; 56:122-31.
Heucken N, Ivanov R. The retromer, sorting nexins and the plant endomembrane protein trafficking. J Cell Sci. 2018; 131.
Sonah H, O'Donoughue L, Cober E, Rajcan I, Belzile F. Identification of loci governing eight agronomic traits using a GBS-GWAS approach and validation by QTL mapping in soya bean. Plant Biotechnol J. 2015; 13:211-21.
Li X, Li X, Fridman E, Tesso TT, Yu J. Dissecting repulsion linkage in the dwarfing gene Dw3 region for sorghum plant height provides insights into heterosis. Proc Natl Acad Sci U S A. 2015; 112:11823-8.
Scutari M, Mackay I, Balding D. Using Genetic Distance to Infer the Accuracy of Genomic Prediction. PLoS genetics. 2015; 12.
Habier D, Fernando RL, Garrick DJ. Genomic BLUP Decoded: A Look into the Black Box of Genomic Prediction. Genetics. 2013; 194:597-607.
Luan T, Woolliams JA, Odegård J, Dolezal M, Roman-Ponce SI, Bagnato A et al. The importance of identity-by-state information for the accuracy of genomic selection. Genet Sel Evol. 2012; 44:28.
Sun X, Fernando R, Dekkers J. Contributions of linkage disequilibrium and co-segregation information to the accuracy of genomic prediction. Genet Sel Evol. 2016; 48:77.
Ren D, Teng J, Diao S, Lin Q, Li J, Zhang Z. Impact of Marker Pruning Strategies Based on Different Measurements of Marker Distance on Genomic Prediction in Dairy Cattle. Animals (Basel). 2021; 11.
Sun C, Shrivastava A, Singh S, Gupta A: Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In: 2017 IEEE International Conference on Computer Vision: 22-29 Oct. 2017 2017. 843-52.
Price AL, Weale ME, Patterson N, Myers SR, Need AC, Shianna KV et al. Long-range LD can confound genome scans in admixed populations. American journal of human genetics. 2008; 83:132-39.
Speed D, Cai N, Johnson MR, Nejentsev S, Balding DJ. Reevaluation of SNP heritability in complex human traits. Nature genetics. 2017; 49:986-92.
Fernando B, Gavves E, Oramas MJ, Ghodrati A, Tuytelaars T. Rank Pooling for Action Recognition. IEEE transactions on pattern analysis and machine intelligence. 2017; 39:773-87.

No competing interests reported.

TableS1.doc
Supplementary Material1: Table S1. Effects of augmentation on the GSCNN’s performance.
TableS2.docx
Supplementary Material2: Table S2. The GSCNN’s prediction accuracy and its comparison to baseline methods.
SupplementaryMaterial3.xlsx
Supplementary Material3: Contribution score of each SNP for soybean traits.
SupplementaryMaterial4.xlsx
Supplementary Material4: Contribution score of each SNP for sorghum traits.

Download PDF

Version 1

posted

You are reading this latest preprint version

GSCNN: A genomic selection convolutional neural network model based on SNP genotype and physical distance features and data augmentation strategy

Status:

Version 1

Abstract

Background

Methods

Results

Conclusion

Figures

Background

Materials and methods

Encoding

Data augmentation

The GSCNN

Training and evaluation

Other models for baseline comparison

Interpretability

Results

Effects of augmentation on the GSCNN’s performance

The GSCNN’s prediction accuracy and its comparison to baseline methods

Ablation study

Discussion

A novel augmentation strategy for SNP data

Advanced deep learning techniques improve GS prediction accuracy

Conclusion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1