Genetically Modied Soybean Lines Exhibit Less Transcriptomic Variation Compared to Natural Narieties

Abstract


Background
Genetically modi ed (GM) crops play essential roles in modern agricultural improvement. The rst GM plants, including antibiotic-resistant petunia and tobacco, were created independently by three research groups in 1983 [1]. Since then, GM crop production has increased dramatically over the past 20 years, growing more than 100-fold from 1.7 million hectares to 180 million hectares globally [2]. In addition to increasing global yield by 22%, GM crops have also reduced pesticide usage by 37% and environmental impact (insecticide and herbicide use) by 18% [3]. Although the rapid development of GM crop production has resulted in signi cant economic bene ts, there has been continued consumer concern that GM products may lead to unforeseen food and environmental safety issues. Remaining questions include the possibility that the insertion of exogenous DNA fragments into crop genomes might lead to the deletion, insertion or rearrangement of genes, thereby affecting biochemical pathways or resulting in the formation of new biological compounds (such as new allergens or toxins). In the risk assessment (RA) of genetically modi ed organisms (GMOs), molecular characterization studies play a crucial role in identifying GM insertions and their target loci. Nucleic acid-level molecular characterization data on GM plants (GMPs) as part of the RA are currently obtained by Southern blot and polymerase chain reaction (PCR) analyses in combination with Sanger sequencing, to determine the precise location of the junction between the transgenic insert and the host genome and to detect the potential presence of backbone sequences from the transformation vector [4]. The development of next-generation sequencing (NGS) technology has introduced an additional or even a replacement tool for the molecular characterization of GMPs [5,6]. In addition to molecular characterization, NGS facilitates transcriptome pro ling [7] and can elucidate the potentially altered expression pro les of genes anking the transgene insert. Recently, highthroughput sequencing combined with "-omics" pro ling technologies, including transcriptomics, proteomics and metabolomics, has been widely used to analyze both target and unexpected effects [8][9][10][11].
Soybean (Glycine max) is an economically important crop and the area planted with transgenic soybeans now constitutes more than 50% of that covered by all transgenic crops. Different exogenous genes that control important agronomic traits, such as abiotic stress [12,13], altered growth/yield [14], and insect resistance [15,16], have been inserted into soybean. Most transgenic modi cations in soybean are related to abiotic stress, such as tolerance to herbicides (e.g., MON87705, MON87708, and MON87701 × MON87705) (http://www.isaaa.org/gmapprovaldatabase/default.asp). Therefore, the safety of GM crops must be evaluated, and substantial equivalence is the cornerstone of these safety assessments. Through the development of new methodologies, indicators of substantial equivalence have become increasingly abundant. Innovative pro ling techniques (such as genomics, transcriptomics, proteomics, and metabolomics) enable comprehensive measurements and comparisons of the transcripts, proteins and metabolites of organisms and provide detailed insights into any unintended effects.
In this study, we performed a transcriptomic sequencing to compare the gene-expression patterns of four tissues from seven different soybean lines (four GM lines, namely, MON87701 × MON89788, MON87708, MON87705 and FG72, and three non-transgenic soybean cultivars, namely, Zhonghuang13, A3525 and JACK), to expand the depth and breadth of our knowledge of GM and non-GM soybean gene-expression pro les. The four primary aims of the study were (1) to determine gene-expression patterns in different tissues in soybean; (2) to compare the obtained expression patterns of natural genotypic soybean lines; (3) to compare differentially expressed genes (DEGs) between GM soybeans and their donor parents; and (4) to determine whether differences in gene expression among natural genotypic soybean lines were greater than those caused by transgenic modi cations.

Results
Statistics of RNA-seq transcript abundance in GM and non-GM plants To identify the molecular events occurring in different tissues of GM and non-GM plants, 28 RNA-seq libraries were constructed using RNA from ten pooled RNA samples. After Illumina sequencing and the removal of adaptors and bad-quality reads, approximately 4,216,226 to 14,932,616 reads were obtained for the 28 libraries (Table 1). Clean reads were then mapped to the soybean reference genome, with the mapped ratio ranging from 43.13% to 89.27%. Among the mapped reads, the frequency of unigenes ranged from 41.21% to 87.95% (Table 1). (PC2) of the total variance (Fig. 1). The two PCs could not separate the GM lines from their non-GM donor parents. For example, MON87705, MON87708 and A3525 were clustered together. The natural soybean lines showed signi cant genetic distance; for instance, Zhonghuang13 was positioned far from the other two natural lines, JACK, and A3525. Notably, the seven datasets from the cotyledon tissues clustered discretely from the other three tissues.

Analysis of differentially expressed genes in the non-GM lines
Gene-expression data were obtained using the gene-expression formula. Genes that were signi cantly differentially expressed between different parental lines were identi ed according to normalized geneexpression levels. Some differentially expressed unigenes were expressed at higher levels in certain lines, whereas others were expressed at lower levels. In total, 7,491 DEGs were detected in the two-group comparisons (non-GM lines/non-GM lines) ( Fig. 2A). When using ZH13 was control, a total of 4,384 DEGs were identi ed between A3525 and ZH13, including 1,888 upregulated genes and 2,496 downregulated genes (Supplementary Table S1), and 6,458 DEGs were identi ed between JACK and ZH13, including 2,298 upregulated genes and 4,160 downregulated genes (Supplementary Table S2). The combined analysis highlighted 3,351 DEGs that were commonly differentially expressed among the three materials, whereas the remaining 3,107 and 1,033 genes were speci cally differentially expressed in the speci c material (Fig. 2B).

Analysis of differentially expressed genes in the non-GM and GM lines
The number of DEGs among the four group comparisons (non-GM lines/GM lines) ranged from 1,836 to 4,551, which represented 15.56% to 38.57% of the total number of genes ( Fig. 2A) Table S6). The combined analysis revealed that 587 DEGs were commonly differentially expressed among the four tissues, and that 382, 1,016 and 2,570 genes were speci cally differentially expressed in speci c comparisons (Fig. 2C).
The numbers of DEGs in each of the two comparisons were then investigated. The total number of DEGs among the three non-GM lines was 7,491 and the total number of DEGs between the GM lines and their donor parents was 6,836. The DEGs were shared by different varieties. The differences in numbers of DEGs among the natural varieties were even larger than those between the GM lines and their donor parents, which is consistent with reports from several previous studies [17,18].
The GO annotations of the two types of comparisons were subsequently compared. The number of GOterm categories (199) for the non-GM lines exceeded the number for the comparison between non-GM and GM lines (166) (Supplementary Table S7). A total of 81 categories appeared in both types of comparisons. The DEGs among the non-GM lines were functionally more varied than the DEGs between the GM lines and their donor parents. These common categories included different biological functions, such as "cellular biosynthetic process", "nuclear transport", and "lipid transport", which are essential for the maintenance of cellular function. Among the three categories of GO terms in non-GM line comparisons, 85 speci c GO items were present, such as "regulation of DNA replication", "regulation of DNA metabolic process", "DNA replication origin binding", and "sequence-speci c DNA binding". This enrichment of DNA-related terms suggests that differences in gene-expression resulting from genetic engineering are primarily associated with changes in DNA-level regulation.

Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of DEGs
The KEGG pathway database is a knowledge base for the systematic analysis of gene functions in terms of networks of genes and molecules in cells and their variants speci c to particular organisms. To analyze further the DEGs between GM and non-GM plants, all the DEGs were analyzed with respect to the KEGG pathway database.
Among the 7,491 DEGs in the non-GM line comparisons, 126 (28.64%) that signi cantly matched to sequences in the database were assigned to eight KEGG pathways. Among these eight pathways, protein processing in endoplasmic reticulum contained the greatest number of DEGs, followed by the pathways ribosome and ribosome biogenesis in eukaryotes (Fig. 4, Supplementary Table S8). This pattern highlights that active metabolic processes occur in these tissues. Among the 6,836 DEGs between the non-GM and GM lines, the pathway protein processing in the endoplasmic reticulum contained the greatest number of DEGs, followed by ribosome and spliceosome pathways. Comparisons of the KEGG pathways revealed seven signi cant pathways, including photosynthesis, glutathione metabolism, arachidonic acid metabolism, ribosome biogenesis in eukaryotes, ribosome, RNA transport and protein processing in endoplasmic reticulum (Supplementary Table S8). Most of the DEGs between the two comparisons belonged to the same classes and did not differ greatly.

Discussion
In this study, four different tissues, cotyledon (C), germ (G), hypocotyl (H), and radicle (R) of four GMPs and three donor parents were selected to analyze the unintended effects. In previous studies, most studies that have analyzed unintended effects in GM and non-GM lines have performed sequencing and analysis on only one speci c tissue or mixed tissue [18,19]. For example, Liu et al.(2020) used the leave tissues of four GE rice lines expressing Bacillus thuringiensis (Bt) genes that developed by genetic engineering and seven rice lines developed by conventional cross-breeding to discover the unintended effects, and there were only tens of DEGs identi ed. While there were thousands of DEGs identi ed in the current study, which allows more reliable and robust analysis of potential unintended effects.
The PCA analyses of the raw datasets showed a distinct separation between samples with different genetic background, while there was no discrimination between GM and non-GM counterparts in transcriptomic levels. Several previous studies showed the same tendency, such as in wheat and barley [20,21]. For example, a higher similarity between a GE variety and its non-GE near-isogenic line was identi ed than between two common bean varieties in both leaf and grain proteomic pro les in Embrapa 5.1 common bean [8,22]. So, the current nding together with previous results suggested that the intrinsic differences in genetic background bring much greater variation on plant transcriptome than by the introduction of foreign genes by genetic manipulation methods.
Besides the PCA analysis results, the other main result of the study was that signi cant differences existing among natural plant varieties, but the absence of signi cant differences between GM plants and their non-GM donor parents from differentially gene expression analysis. The differences in numbers of DEGs among the natural varieties were even larger than those between the GM lines and their donor parents. Besides the DEG numbers, the GO and KEGG analysis results showed that different types of DEGs exsiting because of gene operation or conventional breeding methods. For subgroup "biological process", the biggest category for non-GM/non-GM line comparisons was "cellular metabolic process", and the category "biosynthetic process" was the largest type for non-GM/GM comparasion. For "molecular function" subgroup, the biggest type of non-GM/non-GM comparisons was "cellular component", and the biggest type for GM/non-GM comparisons was "structural molecular activity"; for "cellular component" subgroup, the biggest type of non-GM/non-GM comparisons was "organelle", and the biggest type for GM/non-GM comparisons was "intracellular". These results mean that although there was no signi cant difference of gene numbers for the two types of comparisons, while the gene types were not same because of the different gene introgression method. While all of the above results were only based on the transcriptomic data, the integrated application of multi-omics approaches, such as proteomics, metabonomics, could be used to evaluate the changes in the plants and their biological relevance.

Conclusion
In this study, four different tissues from seven soybean lines, including four GM lines and their three donor parents, were analyzed for DEGs. After gene expression values were calculated, two types of comparisons were performed: either among the non-GM lines or between GM lines and the non-GM natural varieties. In total, 7,941 DEGs were identi ed among the non-GM lines, and 8,461 DEGs in the comparison between the GM and non-GM lines. The GO and KEGG analyses showed that the categories and biological functions of DEGs among the natural varieties were more varied than those in the GM/non-GM line comparison. We conclude that intrinsic differences in genetic background result in considerably more variation in the plant transcriptome than the introduction of exogenous genes by genetic manipulation methods.

Plant materials
A total of seven soybean lines, including three lines obtained by conventional breeding and four lines developed by genetic engineering (GE) that express EPSPS genes, were used in this study (Fig. 5)

RNA-seq library sequencing and mapping
The transcriptome library was sequenced using the Illumina HiSeq 2000 system. After the raw data were generated and the data-processing steps were completed, the clean reads were then mapped to the soybean reference sequences using RSEM software [23]. Mismatches of no more than two bases were allowed in the alignments. The read count for each gene was obtained from the mapping results. The normalized data were fed to SIMCA 14.1 software (Umetrics, Umea, Sweden) for principal component analysis (PCA).

Differentially expressed gene identi cation
Gene-expression levels were calculated based on the numbers of reads mapped to the reference sequence using the FPKM [24] method. The differentially expressed genes (DEGs) were then screened by comparing gene-expression levels. Differential expression analysis of each comparison was performed using the DESeq R package (1.10.1) using the method described by Anders [25]. DESeq provides statistical routines for determining differential expression in digital gene-expression data using a model based on the negative binomial distribution. The resulting P values were adjusted using Benjamini and Hochberg's approach to control the false discovery rate. In this study, unigenes with an adjusted P < 0.05 identi ed by DESeq were considered to be differentially expressed.