Comprehensive Characterization of Transgene Integration in a Similar Intragenesis Rice Event Using Paired-end Whole-genome Sequencing Approach


 Efficient, accurate molecular characterization of genetically modified (GM) organisms is challenging, especially for novel transgenic products of cisgenesis/intragenesis transferred with genes/elements of recipient species. Herein, GM rice event G281, involving transfer with native promoters and an RNA interference (RNAi) expression cassette in a process similar to intragenesis, was subjected to molecular characterization using paired-end whole genome sequencing (PE-WGS). The results showed that transgenes integrated at rice chromosome 3 locus 16,439,674 included a 36 bp deletion of rice genomic DNA, and the whole integration contained two copies of the complete transfer DNA (T-DNA) in a head-to-head arrangement. No unintended insertion or backbone sequence of the transformed plasmid were observed at the whole genome level. Molecular characterization of the G281 event will assist risk assessment and application for a commercial license. Additionally, the findings demonstrate the applicability of PE-WGS for molecular characterization of cisgenesis/intragenesis crops.


Introduction
The commercialization of genetically modi ed organisms (GMO) has brought great bene ts to farmers, consumers, and the environment (ISAAA, 2019). At the request of all countries and regions, the biosafety of each GMO event should be thoroughly evaluated in terms of molecular characterization, food safety, and environmental safety. Only events that satisfy risk assessment should be approved for production and commercialization (Sparrow PA, 2010).
Molecular characterization of GMOs at the genome level often generates information on exogenous DNA integration and its inheritable stability, and a full understanding of exogenous DNA integration is a fundamental step in molecular characterization (Miraglia, et al., 2004). Exogenous DNA integration in GMOs includes integrated exogenous genes, integration sites, anking DNA sequences of all inserts, the copy number of all inserts, and the presence or absence of unexpected vector backbone sequences (Codex, 2003). This analysis plays a pivotal role in GMO biosafety supervision, quantitative detection, establishment of detection methods, and analysis of the content and traceability of GMO products (Yang et al., 2013). This information is crucial for labeling for assessing the phenotype structural stability, food safety, and commercial value of new GM materials ( Kuiper, et al., 2010;Guttikonda, Marri et al. 2016). In the guidelines for GMO risk assessment, several techniques have been speci ed for probing exogenous DNA integration including PCR-based chromosome walking, Real-time PCR, and Southern blotting, etc. (Li et al., 2017) PCR-based chromosome walking approaches coupled with Sanger sequencing are used for identifying exogenous DNA integration and anking recipient genomic regions, but these techniques also have shortcomings, especially sequence bias and failure to generate comprehensive data, which can lead to missing insertion sites, especially for GM events with multiple exogenous DNA insertions (Li et al., 2017). Real-time PCR and Southern blotting are recommended for determining the copy number of exogenous genes and unintended insertion of backbone sequences of transformed vectors (Li et al., 2017). However, real-time PCR is easy to acquire ambiguous results when studying genes with multiple copies and/or tandem repeats of exogenous genes (Li et al., 2017). Southern blotting requires a large number of samples, takes a long time, requires technically accomplished researchers, and the selection of different enzyme digestion sites can lead to differences in copy number (Li et al., 2017). Molecular characterization of most commercialized GM crop events in the past few decades has employed the above methodologies, but publically available data is not always complete and accurate due to the inherent limitations of the methodologies (Yang et al., 2013).
High-throughput next-generation resequencing (NGS) is increasingly used in many research areas as the cost decreases dramatically. NGS has been successfully coupled with bioinformatics pipelines for full or partial molecular characterization of GMO events based on the resequencing technique. Zhang et al., identi ed integration sites and their anking DNA sequences in transgenic cattle (Zhang et al., 2012). Two transgenic glyphosate-tolerant soybean insertion sites and anking sequences were identi ed using whole-genome sequencing (WGS) (Guo et al., 2016). The precise insertion loci and copy number of transfer DNAs (T-DNAs) in transgenic rice plant lines SNU-Bt9-5, SNU-Bt9-30, and SNU-Bt9-109 were determined using NGS-based molecular characterization methods, and the data were comparable with those obtained from Southern blotting analysis (Doori et al., 2017). Furthermore, corresponding bioinformatics pipelines for analysis of NGS data have been developed (Yang et al., 2013;Zhang et al., 2020). NGS technology combined with target enrichment was used to identify genetic integrations from complex food/feed samples (Zastrow-Hayes et al., 2015;Košir et al., 2018). Previous studies have successfully employed WGS coupled with paired-end sequencing (PE-WGS) to identify transgene insertion sites and their anking sequences. However, molecular characterization using current PE-WGS still remains challenging, especially for transgene insertions with endogenous recipient DNA or DNA sequences that are highly similar to host species (Vanblaere et al., 2014;Li et al., 2017).
As recombinant techniques develop and more gene functions are clearly annotated, novel GM crops are being produced via cisgenesis and intragenesis, along with traditional transgenesis introduction of exogenous genes (Holme et al., 2013). Cisgenesis involves using donor DNA fragment from the species itself or from a crosscompatible species without vector backbone DNA except T-DNA border sequences (Schouten et al., 2008). Intragenesis is more similar to cisgenesis, but intragenesis allows the creation of novel combinations of DNA fragments, such as combining functional genetic elements like promoters and terminators, and antisense or RNA interference (RNAi) cassettes (Schouten et al., 2008). Although cisgenesis and intragenesis mostly involve their own DNA fragments, risk assessment should still be performed, including molecular characterization (Eckerstorfer et al., 2019;Paraskevopoulos et al., 2021). However, reports on molecular characterization of cisgenesis and intragenesis in crops are limited, and there is only one report on deciphering the TDNA insertion of partial cisgenic lines of apple 'Gala' using inverse PCR (Vanblaere et al., 2014).
In order to explore the potential of NGS in molecular characterization of cisgenesis/intragenesis, transgenic rice event G281, generated by transforming the binary vector p1300-S450RNAi-hFL-G6 into rice line Xiushui-110 via Agrobacterium-mediated transformation, was used as an example similar to intragenesis (Lin et al., 2010). We performed molecular characterization of the G281 event using PE-WGS and a dedicated pipeline, and discussed the characterization of cisgene integrations in cisgenesis/intragenesis.

Plant materials
Transgenic rice event G281 is associated with high sensitivity to bentazon, high tolerance to glyphosate, and high expression of human lactoferrin (hLF) protein in seed, according to Zhejiang University, China. T-DNAs including tandem arrays of three expression cassettes of hLF, G6 5-enolpyruvylshikimate-3-phosphate synthase (G6 EPSPS), and the RNAi of rice CYP81A6 gene were introduced into the non-GM line Xiushui-110 via Agrobacterium-mediated transformation (Lin et al., 2010). Seeds of transgenic rice event G281 and its recipient line (Xiushui 110) were kindly supplied by the developer. Rice plants of G281 and its recipient line were grown in a greenhouse in Shanghai Jiao Tong University, China, and fresh leaves were sampled for molecular characterization. Fresh sampled leaves were washed, dried, and ground into powder in liquid nitrogen for DNA extraction. Rice genomic DNA was extracted and puri ed using a Plant Genomic DNA Kit (Cat. no. DP305-3; TIANGEN). The quality and quantity of extracted DNA was evaluated with a Nanodrop ND-8000 spectrophotometer (Thermo Fisher Scienti c, Waltham, MA, USA) and by 1% agarose gel electrophoresis, respectively. Puri ed DNA samples were stored and used for paired-end whole-genome sequencing (PE-WGS), PCR veri cation, and droplet digital PCR (ddPCR) analyses. In PCR and ddPCR analyses, three individual biological replicates were performed.

PE-WGS
Approximately 5 µg of genomic DNA from GM or wild-type (WT) rice lines was used for PE library construction with an Illumina Paired-end Preparation Kit (Illumina Inc, USA) with an average insert size of 500 bp in length. Library quality was evaluated by Pico-Green (Quant-iT; Invitrogen, USA) and an Agilent Bioanalyzer DNA 1000 kit (Agilent Technologies, USA). Constructed libraries were subjected to sequencing on an Illumina HiSeq 2000 DNA sequencer (Illumina Inc, USA) by BGI-Shenzhen (Shenzhen, China), and 100 bp PE reads were generated. Raw sequencing data were preprocessed to remove index sequences, short and low-quality reads, and read lengths were uni ed. The resulting 100 bp trimmed sequencing read data were used for further analysis.

Bioinformatics pipeline for PE-WGS data analysis
PE-WGS data were analyzed using our previously developed bioinformatics pipeline with slight modi cations to add functionality and avoid false-positive reads from host native genome DNAs (Figure 1). To reduce interference from native genome DNAs, false-positive reads were ltered by searching for the positions of similar sequences between plasmids and references using BLASTN (Chen et al., 2015). Candidate results were generated after blocking the positions ± the insert sizes of homologous sequences. Burrows-Wheeler Aligner (BWA version 0.6.2) (Li and Durbin, 2009), SPAdes Genome Assembler (SPAdes version 3.11) (Bankevich et al., 2012), Sequence Alignment/Map tools (SAM tools version 1.6) , and Integrative Genomics Viewer (IGV, version 2.4) (Robinson et al., 2011) formed the pipeline for read mapping, assembly, alignment storage, and visualizing mapping results, respectively. Transformation plasmid sequences of G281 and rice reference gene sequences (RefSeq assembly accession: GCF_001433935.1) were used as reference sequences in the analysis.

Conventional PCR and Sanger sequencing
Primer pairs used for verifying transgene insertion sites and their anking sequences were designed based on the candidate read pairs identi ed in PE-WGS analysis (Supplementary Table S1). Conventional PCR was performed with a Veriti thermocycler (ThermoFisher Scienti c) in a total volume of 50 µl, including 5 µl of 10× Ex Taq Buffer (Takara, Japan), 5 µl of dNTPs (2 mM), 0.25 µl of Ex Taq polymerase (Takara), 2 µl of 10 µM forward primer, 2 µl of 10 µM reverse primer, 2 µl of DNA template, and 33.75 µl of RNase-free and DNase-free water. Each reaction was performed in triplicate, and thermal cycling included 5 min at 98°C, followed by 35 cycles of 30 s denaturation at 94°C, 30 s annealing at 55°C, and 1 min extension at 72°C, then an additional extension step at 72°C for 7 min. Ampli ed products were puri ed, sent to Invitrogen (Shanghai, China) for Sanger sequencing, and the sequencing results were used as query sequences for BLASTN searching (Camacho et al., 2008).

Droplet Digital PCR
The ddPCR primers and probes for G6 EPSPS and hFL genes were designed using beacon designer software version 8.0, and our previously reported primers and probes for rice endogenous reference gene SPS were also used . All primers and probes were purchased from Invitrogen (Shanghai, China) and are listed in Supplementary  Table S2. The ddPCR assays were performed on a QX200 Droplet PCR platform (Bio-Rad, Pleasanton, CA, USA) with a nal volume of 20 µl comprising 10 µl of 2× ddPCR Supermix (Bio-Rad, Pleasanton, CA, USA), 1 µl of 10 µM forward primer, 1 µl of 10 µM reverse primer, 0.5 µl of 10 µM probe, 1 µl of DNA template, and 6.5 µl of RNase-free and DNase-free water. After preparation of the 20 µl reaction mix for ddPCR, droplets were generated in 8-well cartridges using a QX200 droplet generator (Bio-Rad, Pleasanton, CA, USA). Water-in-oil emulsions were transferred to a 96-well plate and ampli ed in a T100 PCR cycler (Bio-Rad, Pleasanton, CA, USA). The procedure for ddPCR assays included 5 min at 95°C, followed by 40 cycles of a two-step thermal pro le comprising 30 s at 95°C and 60 s at 58°C at a ramp rate of 2.0°C/s. After cycling, each sample was incubated at 98°C for 10 min then cooled to 4°C. After PCR ampli cation, PCR plates were transferred to a QX200 droplet reader (Bio-Rad, Pleasanton, CA, USA), and data acquisition and analysis were performed using QuantaSoft (Bio-Rad, Pleasanton, CA, USA). All reactions were repeated in triplicate with three parallels.

Sequencing data from PE-WGS
Transgenic rice G281 (GM) and non-GM rice Xiushui 110 (WT) were sequenced, and 11.7 GB and 12.2 GB data comprising 54,130,964 and 54,819,968 pairs of 100 bp reads were acquired, respectively. General sequencing coverage was up to 28.91´ for GM rice and 29.28´ for WT rice. Additionally, both ends of the 53,545,140 and 54,233,374 read pairs were mapped to the rice reference genome ( Table 1). The calibration weight R with values of 0.986 for both GM and WT was used to modify bias among the mapped sequencing data, which was calculated from the percentage of read pairs that mapped to the reference genome among the total number of trimmed paired reads.

Exogenous DNA integration sites and anking sequences
All trimmed read pairs were analyzed according to the pipelines depicted in Figure 1. For the G281 line, 53,545,140 read pairs and 2604 read pairs mapped to the rice reference genome and the transformed plasmid, respectively. A total of 322 read pairs were extracted for which one end mapped to the rice reference genome and the other mapped to the transformed plasmid using the original TranSeq pipeline without false-positive reads removed. Using the modi ed TranSeq pipeline, 2284 false-positive read pairs were discarded and only 69 read pairs were retained as candidate reads for determining the T-DNA insertion site and its anking sequences (Table 1). Among the 69 potential candidate reads, 62 read pairs mapped to Chr 03, two pairs mapped to Chr 02, and one pair mapped to Chr 04, 06, 08, 11, and 12. After considering the sequencing depth (28.91´) and BLASTN analysis, the 62 read pairs mapped to Chr 03 were con rmed as true candidates ( Table 2). The sporadic read pairs mapped to other chromosomes might be due to data jitter, sequencing errors, and host genome noise. IGV visualization of these candidate read pairs showed that only one transgene integration was introduced at locus 16,439,674 of Chr 03 based on clustering pattern analysis with reference rice (GCA_001433935.1; Figure 2A). The assembled eight contigs (Supplementary Sequence File 1) generated from the 62 read pairs in de novo analysis clearly showed the T-DNA integration site at Chr 03 locus 16,439,674 with a 36 bp rice host genome DNA deletion, and two contigs with lengths of 478 bp and 434 bp comprising the 5' and 3' end anking sequences adjacent to transgene insertion ( Figure 2B). Further conventional PCR results showed that the expected DNA fragments covering the adjacent regions of the inserted site were only observed in G281 rice, while no amplicons were observed in WT rice. Sanger sequencing of the expected amplicons con rmed that the obtained integration site and anking sequence information were identical with those from PE resequencing analysis ( Figure 2C).
The results of WT resequencing data yielded 54,819,968 and 322 read pairs that mapped to the rice reference genome and the transformed plasmid, respectively (Table 1). Only ve read pairs were extracted as candidates in which one end mapped to the rice reference genome and the other mapped to the transformed plasmid, including four reads pairs and one read pair that appeared to match Chr 02 and Chr 07, respectively (Supplementary Table S3).
Further BLASTN analysis of these read pairs showed that all ve were false-positives due to homologous sequences in transformation vectors or sequencing background noise. The WT resequencing results also con rmed its non-GMO authenticity, and the data were helpful for further identifying transgene integration in intragenesis rice line G281.

Copy number of exogenous DNA
To evaluate the copy number of the inserted transgenes based on the resequencing data, the formula was used to calculate the copy number. In the formula, ADT represents the Average sequencing Depth of the Target gene, D represents the general sequencing depth, and R is the relative relationship between mapped reads and the host genome. To improve the accuracy of copy number evaluation and decrease the sequence bias in resequencing, the single copy rice endogenous reference gene, SPS, was used as a calibrator. The calculated copy number for the G6 EPSPS gene, the hLF gene, and the rice Gt1 promoter was 2.11, 1.95, and 3.1, respectively (Table 2), indicating that there were two copies of each transferred into the rice genome.
Furthermore, the G6 EPSPS and hLF gene copy number was also analyzed using the ddPCR method with genespeci c primers and TaqMan probes to compare with the copy number from NGS analysis. The copy number was calculated as the ratio of transgene and rice endogenous reference gene SPS (copy/copy). The ddPCR results showed that the copy number of G6 EPSPS and hLF genes was 1.87 and 1.96, respectively ( Table 2). The results from both resequencing data and ddPCR suggested that the copy number of G6 EPSPS and hLF genes was two, and the copy number from NGS analysis was slightly higher than that from ddPCR. Thus, based on the NGS and ddPCR results, we concluded that two copies of plasmid T-DNAs were inserted into the rice genome.

Arrangement and full sequencing of whole transgene integration
A total of 2604 read pairs in which both ends were well mapped to the transformed plasmid were extracted for rice G281. These extracted reads were used to assemble transgene integration using SPAdes. A total of four contigs were obtained, including one with a length of 9145 bp, while the other three were shorter (279 bp, 266 bp, and 213 bp; Supplementary Sequence File 2). BLASTN results showed that only the 9145 bp contig corresponded to transgene integration, which contained three RNAi expression cassettes; one for CYP81A6 including a 3' untranslated region (UTR), the RNAi sequence, and the CaMV35s promoter; one for G6 EPSPS (Pepc terminator, G6 EPSPS gene, and ubiquitin promoter); and one for hLF (Pepc terminator, hLF gene, and rice glutelin promoter, Gt1) in a tandem arrangement (Figure 3). Based on the copy number evaluation and the single transgene integration site in Chr 03, we concluded that the transgenic T-DNA inserted into rice Chr 03 locus 16,439,674 included two copies of complete T-DNAs. Furthermore, the anking sequences adjacent to the integration site showed that both the 5' and 3' ends of rice genome DNAs were connected to the 3' UTR of the RNAi expression cassette, indicating that the two T-DNAs were arranged head-to-head. The conventional PCR results also con rmed that these two copies of entire T-DNA sequences were inserted in a head-to-head rearrangement. A schematic diagram of the whole DNA insertion is shown in Figure 3.

Unintended integration and plasmid backbone residual insertion
Unintended DNA integration and plasmid backbone residual insertions often occur during transgene transformations, due to the random nature of exogenous DNA integration, especially when using particle bombardment transformation. By mapping the cleaned raw NGS data for both GM and WT rice lines to the transformation plasmid sequence, we found no unintended integration or insertion of plasmid backbone residual sequences in G281, other than the T-DNA region (Figure 4). We also performed conventional PCR analysis targeting the plasmid backbone sequences and did not obtain any positive amplicons from G281 DNA (data not shown), further supporting the conclusion that there were no residual transformation plasmid backbone sequences in G281.

Discussion
Clear molecular characterization of GM events is required to approve GMOs and GM products for commercialization, including new transgenic crops produced using DNA recombinant techniques such as cisgenesis and intrasgenesis (Turnbull et al., 2021). For a comprehensive understanding of transgene integration, NGS has been often accepted and employed to decipher the transgene integration of GMOs. Advantages of NGS include higher e ciency, accuracy, and reliability than with traditional sequencing methods (Arulandhu et al., 2018). However, most published research has focused on transgene integration in conventional transgenic crops/animals, and few studies have explored the possibility of using NGS approaches for molecular characterization of cisgenesis and intragenesis. In the present work, the G281 transgenic rice event was generated by transforming a plasmid containing three gene expression cassettes. In the plasmid, the 5' end of T-DNA is connected to the RNAi sequence of the rice CYP81A6 gene, and its 3' end is connected to the rice Gt1 promoter. According to the de nition of intragenesis, the G281 line can be regarded as having a structure similar to that acquired by intragenesis, although the exogenous G6 EPSPS and hLF genes are derived from other species (Lin et al., 2010). In the present work, we tried to develop a suitable pipeline to characterize transgene integration by intragenesis, employing G281 as an example. We performed successful molecular characterization of G281, including the transgene insertion site and its anking sequences, copy number of transgenes, unintended insertions, and transformed plasmid backbone residues.
The DNAs of both ends of the T-DNA region are from the rice host genome in G281, making it di cult to identify the integration site. Using previously developed pipelines, numerous false-positive read pairs from the rice host genome are extracted, decreasing the e ciency and accuracy of screening and the identi cation of candidate reads. Herein, 322 read pairs were extracted as candidates from G281 data using the TranSeq pipeline. However, only 69 read pairs remained when we used the modi ed pipeline with an additional step restricting the positions ± insert sizes of homologous sequences. According to the ltering of read pairs, the transgene integration site was easily identi ed with high accuracy, whereas more than 75% of extracted reads were false-positives using the original TranSeq pipeline (Yang et al., 2013). These results indicate that the modi ed pipeline avoided the in uence of host genome DNA, hence it could be employed for analysis of new cisgenesis/intragenesis lines.
To determine the sequence and structure of transgene integration, we combined the results of copy number evaluation and de novo analysis, revealing one entire transgene integration at rice Chr 03 involving two repeated T-DNA regions of transformed plasmid, and the two repeats are arranged in a head-to-head manner. In general, the whole transgene integration could be assembled based on reads mapped to the transformed plasmid. However, this is di cult when multiple copies of T-DNAs are inserted. Since the reads from NGS data are very short, identifying DNAs with multiple repeats is challenging (Park et al., 2017). In G281, the whole transgene integration was not obtained directly from de novo analysis. We only assembled one repeat of the T-DNA region with a length of 9145 bp. Based on the sequences of anking sequences and data from PCR ampli cation of the links between two repeats, we con rmed the whole structure and sequence of the transgene integration. The result of copy number evaluation corrected the previous reported result of only one single copy of transgene insertion in G281 line from Southern blotting analysis (Lin et al., 2010). Some recent reports have used third-generation sequencing (TGS) approaches to reveal whole transgene integration (Nicholls et al., 2019;Peng et al., 2021;Chen et al., 2021;Giraldo et al., 2021).
However, analyzing whole transgene integration involving complex structures and/or rearrangements remains di cult using TGS approaches because of the absence of effective data analysis pipelines and quite high costs.
Molecular characterization of the G281 line using the developed pipeline showed that the approach is suitable for investigating intragenesis lines, even though intragenesis is more complex than the G281 event. When exploring intragenesis lines, control lines are more important and should be sequenced at the same time. By comparing NGS data between intragenesis and control lines, false-positive reads can be ltered out using dedicated pipelines. For example, 262 read pairs extracted for the WT line were identi ed as potential candidates for determining the transgene integration site. Obviously, there were no transgenes transferred in the WT line, hence these 266 read pairs must be derived from homogenous sequences between the host genome and the transformed plasmid, sequencing bias, or sequencing data contamination. According to the reads extracted from the WT line, we also found reads that were similar to those in G281, hence the similar reads were false-positives. Therefore, control lines are essential for analysis of intragenesis lines.
In summary, we modi ed our previously established TranSeq bioinformatics pipeline, and performed full molecular characterization of GM rice G281 using a PE-WGS approach with the modi ed pipeline. We identi ed only a single integration site in Chr 03 with two head-to-head repeats of complete T-DNAs inserted in G281. No residual transformation plasmid backbone sequences or unintended integration of exogenous DNA was detected in G281. The results con rm that the PE-WGS and modi ed TranSeq pipeline approach can be successfully applied to characterize transgene integration in intragenesis lines. Figure 1 Modi ed bioinformatics pipeline for molecular characterization of transgenic plants using the PE-WGS approach. Figure 2