Influence of the PD between the SNPs on the GSCNN’s performance
In the GSCNN, each SNP is represented by two letters that represent the genotype and PD or the LD with the next SNP. The PD or LD are represented by different letters according to different classification methods. As presented in Table 1, the addition of either PD or LD to the encoding improves the average prediction accuracy of GSNN on six traits compared to encoding SNP with genotype only. When encoding SNP with genotype and narrow-classified PD (classification: 0–10 kb, 10–20 kb, 20–50 kb, and > 50 kb), the GSCNN achieved the highest average prediction accuracy for the six traits (0.6496), thus exhibiting improvements of 2.28%, 1.88%, 1.26%, and 1.18% compared to encoding SNP with genotype only, genotype and LD (classification: 0–0.2, 0.2–0.5, 0.5–0.7, and > 0.7), genotype and LD (classification: 0–0.7, 0.7–0.8, 0.8–0.9, and > 0.9), genotype and wide-classified PD (classification: 0–100 kb, 100–200 kb, 200–500 kb, and > 500 kb), respectively.
Table 1
The GSCNN’s prediction accuracy according to different SNP encodings.
SNP encoding | Soybean maturity | Soybean height | Soybean lodging | Soybean protein | Sorghum HT-IL | Sorghum FL-IL | Mean |
I | 0.5927 | 0.562 | 0.4467 | 0.5491 | 0.8406 | 0.8194 | 0.6351 |
II | 0.5976 | 0.5699 | 0.4507 | 0.5529 | 0.8346 | 0.8196 | 0.6376 |
Ⅲ | 0.6011 | 0.5839 | 0.4483 | 0.5644 | 0.8300 | 0.8212 | 0.6415 |
Ⅳ | 0.5950 | 0.5773 | 0.4500 | 0.5544 | 0.8462 | 0.8289 | 0.6420 |
Ⅴ | 0.6059 | 0.5880 | 0.4575 | 0.5662 | 0.8525 | 0.8274 | 0.6496 |
I, Only genotype; II, genotype with LD (LD classification: 0–0.2, 0.2–0.5,0.5–0.7,and > 0.7); Ⅲ, genotype with LD (LD classification: 0–0.7, 0.7–0.8, 0.8–0.9, and > 0.9); Ⅳ, genotype with PD (PD classification: 0–100 kb, 100–200 kb, 200–500 kb, and > 500 kb); Ⅴ, genotype with PD (PD classification: 0–10 kb, 10–20 kb, 20–50 kb, and > 50 kb).
Effects of augmentation on the GSCNN’s performance
Based on the most effective classification method for PD according to our results (0–10 kb, 10–20 kb, 20–50 kb, > 50 kb), we evaluated the impact of different augmentation strategies on the GSCNN. We proposed a novel approach to augmenting biological sequence data in which the order of chromosomes is shuffled and its effectiveness on the two datasets is then evaluated (Tables S1). The relative change in the GSCNN’s percentages according to different augmentation strategies compared to no augmentation is shown in Fig. 3. For online augmentation, the prediction accuracy of all traits, except for soybean lodging and protein, decreased compared to no augmentation. The decline was particularly notable for sorghum data. Conversely, the offline augmentation strategy improved all trait prediction accuracies. For soybean maturity, lodging, and protein, the GSCNN’s prediction accuracy improved the most when the training set was offline augmented five-fold (1.55%, 5.55%, and 2.86%, respectively). For soybean height, sorghum HT-IL and FL-IL, GSCNN's prediction accuracy gradually improved with the increase in offline augmentation fold. The highest improvement was observed with eight-fold offline augmentation (6.4%, 5.54%, and 2.49%, respectively). The average prediction accuracies of the six traits without augmentation, with online augmentation, and with offline two-fold, offline five-fold, and offline eight-fold augmentations were 0.6496, 0.6025, 0.6616, 0.6723, and 0.6740, respectively. Therefore, offline augmentation was more powerful than online augmentation.
The GSCNN’s prediction accuracy and its comparison to baseline methods
We evaluated the prediction accuracy (R2) of the GBLUP, RKHS, Bayes B, BL, DLGWAS, and GSCNN (offline eight-fold) on two datasets: soybean and sorghum (Fig. 4, Table S2). GSCNN achieved the highest prediction accuracy for soybean maturity, height, lodging and sorghum HT-IL, and slightly worse than Bayes B and BL for soybean protein and sorghum FL-IL. The average prediction accuracy on six traits of GSCNN (0.6740) improved by 16.73%, 16%, 3.10%, 3.47%, and 56.08% compared to GBLUP (0.5778), RKHS (0.5779), Bayes B (0.6542), BL (0.6518), and DLGWAS (0.4321) respectively.
Ablation study
These results demonstrate the effectiveness of the GSCNN, which is inseparable from its design. The BERT embedding in the GSCNN can dynamically adjust the word vector based on the different contexts in which the same word appears, enabling it to learn higher-quality word embeddings. The attention pooling in the GSCNN enables the derivation of effective representations for complex trait prediction, which necessitates the analysis of higher-order statistics that may not be easily achieved through average pooling or max pooling [30].We evaluated the performance of the following GSCNN variants: GSCNN-B, indicating a variant that uses embedding instead of BERT embedding; GSCNN-A, denoting a variant that uses average pooling instead of attention pooling; and GSCNN-M, referring to a variant that uses max pooling instead of attention pooling. For computational efficiency, no data augmentation strategy was used in any of the models. As shown in Table 2, the prediction accuracy of the GSCNN in terms of soybean maturity, height, loading, protein, and sorghum HT-IL and FL-IL was 0.6059, 0.5880, 0.4577, 0.5662, 0.8525 and 0.8274, respectively. GSCNN achieves the highest accuracy for all six traits, thus indicating the important role of BERT embedding and attention pooling.
Table 2
The prediction accuracy of the GSCNN and its variants.
Model | Soybean maturity | Soybeanheight | Soybean lodging | Soybean protein | Sorghum HT-IL | Sorghum FL-IL | Mean |
GSCNN-B | 0.6037 | 0.5709 | 0.4550 | 0.5628 | 0.8490 | 0.8265 | 0.6446 |
GSCNN-A | 0.5738 | 0.5576 | 0.4441 | 0.5560 | 0.8449 | 0.8097 | 0.6310 |
GSCNN-M | 0.5917 | 0.5602 | 0.4449 | 0.5626 | 0.8393 | 0.8167 | 0.6359 |
GSCNN | 0.6059 | 0.5880 | 0.4575 | 0.5662 | 0.8525 | 0.8274 | 0.6496 |
Interpretability of the GSCNN
Refining the explanation behind a deep learning model’s predictions can be just as critical as the prediction accuracy in GS. One of the most popular intelligence approaches is \(\text{I}\text{n}\text{p}\text{u}\text{t}\times \text{G}\text{r}\text{a}\text{d}\text{i}\text{e}\text{n}\text{t}\), which we used to calculate each SNP’s contribution to the prediction (Supplemental file1, Supplemental file2) [29]. As depicted in Fig. 5, the majority of the SNPs exhibited similar predictive power, with only a small subset demonstrating greater influence. Therefore, we identified these high-impact SNPs as potential loci for subsequent association analyses.
SNP Gm04_29528926 is near the E8, which is involved in the time of flowering and maturity locus of soybeans [31]. Glyma.02g058500, near SNP Gm02_5360523, is a homolog of INDETERMINATE DOMAIN1/ENHYDROUS, which promotes seed maturation by regulating light GA effects and ABA signaling in Arabidopsis [32]. Substantial evidence supports the strong correlation between plant height and lodging in soybeans, and our experimental results also support this conclusion as we observed a similar curve pattern in the Manhattan plot of the SNPs’ contribution scores for plant height and lodging, with high-contributing SNPs being concentrated on chromosomes 13 and 19, showing significant overlap [33, 34]. The copy numbers of Glyma.13G287600 and Glyma.13G288000, near Gm13_38408846, were negatively correlated with trailing growth and shoot length, which determine plant height and, in turn, affect lodging [35]. Gm08_45695835 is near the significant region of soybean protein on Gm08 from 45.5 to 46.9 Mb [36]. In terms of sorghum, S7_58504205, S7_55229509, and S9_57192617 are near the significant SNPs associated with plant height: Dw3, qHT7.1, and Dw1, respectively [37].