Subtelomeric 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS) copy number variation confers glyphosate resistance in Eleusine indica

Genomic structural variation (SV) can have profound effects on an organism’s evolution, often serving as a novel source of genetic variation. Gene copy number variation (CNV), a specific form of SV, has repeatedly been associated with adaptive evolution in eukaryotes, especially to biotic and abiotic stresses. Resistance to the most widely used herbicide, glyphosate, has evolved through target-site CNV in many weedy plant species, including the economically important cosmopolitan grass, Eleusine indica (goosegrass); however, the origin and mechanisms of these resistance CNVs remain elusive in many weed species due to limited genetic and genomics resources. In order to study the target site CNV in goosegrass, we generated high-quality reference genomes for both glyphosate-susceptible and -resistant individuals, fine assembled the duplication of glyphosate’s target site gene enolpyruvylshikimate-3-phosphate synthase (EPSPS), and revealed a novel rearrangement of EPSPS into the subtelomeric region of the chromosomes, ultimately leading to herbicide resistance evolution. This discovery adds to the limited knowledge of the importance of subtelomeres as rearrangement hotspots and novel variation generators as well as provides an example of yet another unique pathway for the formation of CNVs in plants.

Some regions of the genome, as well as some gene families, are especially prone to generating SVs 51 and CNVs. Chief among these are regions of highly repetitive sequences where unequal crossing over 52 happens frequently due to misalignment of sister chromatids, homologous chromosomes, and even non-53 homologous chromosomes. Chromosomes are often highly repetitive at their ends in the telomeres and 54 subtelomeres. The subtelomeres are loosely defined and vary across taxa but are typically the next 50kb to 55 100kb of genome adjacent to the telomeres. Subtelomeres, while generally gene poor, can be sources of 56 novel CNV events as crossing over can happen frequently. For instance, it has been previously shown in 57 Phaseolus vulgaris that certain pathogen resistance genes exist in or near the subtelomere and that due to 58 their proximity to the subtelomeres, these resistance genes are highly duplicated leading to certain pathogen 59 resistance phenotypes 9 . This phenomenon is also relevant in monocotyledonous species. In allohexaploid 60 bread wheat (Triticum aestivum), there is generally less synteny of genic regions among subgenomes in the 61 subtelomeres compared to interstitial regions between the centromeres and subtelomeres, partially from 62 high levels of CNV 10 . CNVs in subtelomeres are not limited to plants; in human subtelomeres, CNVs 63 constitute around 80% of the most distal 100kb of the chromosomes 11 . 64 In plants, we are only now beginning to understand the importance of SV as a novel source of 65 genetic variation due to the advent of ubiquitous and inexpensive genome sequencing technologies 12 . 66 Projects like the 1001 Genomes Project 13 are thoroughly investigating SV and its importance in Arabidopsis 67 thaliana. Additionally, there have been other examples demonstrating the importance of SVs in the 68 evolution of crops and non-crops and the massive effects that they can have on phenotypes 6,14-16 . One of 69 the most striking examples of SV in action has been the evolution of glyphosate (Roundup) resistance in 70 weedy species that effect row crop production. In cases where CNVs cause glyphosate resistance, the plant 71 over-produces the EPSPS enzyme, to a degree that an enormous amount of glyphosate is needed to inhibit 72 the entire EPSPS protein pool. 73 At least nine divergent, monocot and eudicot weed species have independently evolved glyphosate 74 resistance via EPSPS CNV, an astounding example of convergent evolution 17 . Furthermore, each species 75 has evolved these CNVs uniquely. The first EPSPS CNV was discovered in Amaranthus palmeri (palmer 76 amaranth) 18 . It was eventually discovered that the EPSPS gene is being duplicated by a novel, extra-77 chromosomal, circular piece of DNA named "the replicon" 19,20 . The replicon independently replicates from 78 palmer's core genome and tethers itself to the core genome during cellular division 20,21 . Other weed species 79 have duplicated EPSPS in more familiar ways; for example, EPSPS is duplicated in Bassia scoparia 80 (kochia) in tandem and is thought to be the result of a combination of transposable element activity and 81 unequal crossing over 22 . Efforts to identify mechanisms of CNV formation have been concentrated on dicot 82 species in the Americas, although globally, weedy monocot species are more problematic. 83 Despite E. indica's global economic importance, molecular biology and genomics research has 84 remained difficult due to the lack of a quality reference genome and other molecular tools 23,24 . In this work, 85 we generated a nearly end-to-end assembly of a glyphosate susceptible (GS) goosegrass individual and a 86 near complete assembly of a glyphosate resistant (GR) individual. Furthermore, we use genomic 87 resequencing, comparative genomics, and transcriptomics to identify the complete genomic region involved 88 in goosegrasses EPSPS CNV event and provide insight in the mechanism driving increased copy numbers. 89 This is the first in-depth investigation into the nature of EPSPS CNV in a grass species and is the first 90 description of subtelomeric repeat driven herbicide resistance in any species. On average, gene density and transposable element density vary inversely in both GR and GS 106 genomes; gene density decreases near the centromeres but increases in the distal parts of the chromosome 107 arms, with the opposite being true for transposable element density (Fig. 1). There are higher numbers of 108 LTR transposons (i.e. Copia, Gypsy, and all other TEs), as well as other transposable elements, clustering 109 at the centromeres and the most distal arms of each chromosome near the telomeres. The Gypsy superfamily 110 is especially prevalent, having the highest transposable element density (Fig. 1) two, three, and four have these tandem telomeric repeats at one terminal end and the remaining 121 chromosomes, one, five, six, seven, eight, and nine do not contain any terminal tandem telomeric repeats 122 (Fig. 2). In the GR genome, chromosomes one, two, four, seven and eight begin and end with tandem 123 telomeric repeats, while chromosomes three, five, six, and nine only have tandem telomeric repeats at one 124 end of the chromosome (Fig. 2). This indicates we have captured most but not all the full-length 125 chromosomes in the GR genome. The increase in telomere-to-telomere coverage in the GR genome 126 compared to the GS may be explained by biological factors, such as increased homozygosity of the 127 resistant line, or computational factors, such as different amounts of input PacBio or PacBio read N50 128 size. Most likely the highly repetitive chromosome ends and redundancy between the telomeric repeats 129 makes further refinement difficult without the implementation of novel techniques specifically designed 130 to resolve these regions.

131
The GS and GR assemblies are highly syntenic, as indicated by the linear arrangement of large 132 (>10kb) lengths of nearly identical sequences that remain in order over entire chromosomes (Fig. 2). Due 133 to the high amount of synteny between the assemblies, the slightly more fragmented GR genome has been 134 manually ordered and named to maximize alignment with the GS genome. In the GR assembly, 135 chromosomes three and five are composed of two contigs each while chromosomes six, and seven are 136 comprised of three contigs, with the remaining chromosomes being in single contigs. Two large inversions 137 were assembled in the GR genome at the ends of chromosomes five and six; and without Hi-C or optical 138 mapping, we cannot say for certain how supported these inversions are or if they have any impact on the 139 resistance phenotype. 140 In both the GS and GR genomes, EPSPS, glyphosate's target, was located approximately 1.6Mb 141 from the telomere of chromosome three. In the GR genome, EPSPS was also identified in 23 short 142 unscaffolded contigs, but always co-assembled in these un-scaffolded contigs with a copy of a sequence 143 normally 2.4Mb from the telomere of chromosome three, about 1Mb downstream of EPSPS. We call the 144 original location of EPSPS in the genome "Region-A," and the region co-assembled with Region-A, 145 "Region-B" for ease of discussion and clarity ( Fig. 1; Fig. 3 Fig. 3). Seven of the eight GR individuals had EPSPS read 160 depths of approximately 25-29x above background except for sample R14 which only had 8x (Fig. 3). From 161 this observation we believe this sample was a heterozygote, while the other seven GR individuals had copies 162 in both the paternal and maternal genome. Additionally, these results indicate both that Region-A and 163 Region-B are co-duplicated in every copy of the EPSPS CNV, and that CNV and the other containing the TIPS mutation. Possibly associated with the TIPS mutation, a 176 synonymous alanine substitution (GCA to GCG) was also found 28 amino acid residues upstream from 177 T102I on all reads containing the TIPS mutations in the same 12% read coverage. The co-occurrence of 178 TIPS and EPSPS CNV in this individual is not unexpected because it has been shown previously that E. 179 indica individuals can have stacked resistance mechanisms and fitness penalties associated with mutations 180 can be compensated for with unmutated copies of EPSPS. Such scenarios were predicted when the TIPS 181 double mutation was first identified 2 and later discovered in goosegrass 4 . Furthermore, it indicated that 182 SNPs should also be carefully looked for in GR weeds with low EPSPS CNV frequency as a potential 183 primary source of glyphosate resistance. 184 In GR individuals, Region-A and Region-B are fused and associated with a 3,396bp region of 185 unknown origin inserted at the beginning of Region-B, labeled 'Region-I' (Figures 4a and 4b). Region-I 186 contains no predicted genes or features of significance we can decern. Together, Region-A, Region-B, and 187 Region-I makeup the entire 'EPSPS-Cassette', a structure not found in any of the glyphosate-susceptible 188 individuals (Fig. 3a). Flanking the EPSPS-Cassette on both sides is a 452bp subtelomeric sequence that can 189 also be found in the most distal part on various chromosomes in the susceptible and resistant E. indica 190 genomes as well as to the subtelomere from other grass species. On one side of the cassette the subtelomeric 191 sequence is repeated 12 times in the forward and 31 times in the reverse directions and serves to invert the 192 EPSPS cassette. On the other side is a much larger array of repeats consisting of a minimum of 43 repeats 193 in the forward and 294 repeats in the reverse directions (Fig. 3a). We were able to span and assemble across 194 the small array of repeats; however, the larger array became ambiguous. Given previous fluorescent in situ 195 hybridization (FISH) experiments EPSPS exists on two chromosomes in GR E. indica 27 . This information 196 indicates that the EPSPS-Cassette is not scattered or widely dispersed but located in tandem on the ends of 197 one or two chromosomes.

198
To verify the EPSPS-Cassette model presented here and specifically confirm each region-junction 199 PacBio resequencing data were aligned to the EPSPS-Cassette (explicitly junctions checked are: and forward EPSPS-Cassettes (Fig. 3b). The number of reads supporting the inversion is almost exactly 204 half of the number of reads that support all other junctions, indicating that the inversion exists half as much 205 as the others. That is to say, the full-length cassette consists of one forward and one reverse copy of the A-206 B fusion, joined by the inverted, shorter subtelomeric repeat and flanked by much larger subtelomeric 207 sequences on both sides. 208 In addition, an RNA-seq experiment was performed to investigate gene expression changes driven 209 by the EPSPS CNV. Although the entire EPSPS-Cassette and all genes within it are co-duplicated in GR 210 individuals, four of the five genes in Region-A are significantly overexpressed (p-value<0.01 and fold-211 change>2), while only one out of four genes in Region-B is significantly overexpressed (Table 2, Fig. 5). 212 Genes overexpressed other than EPSPS from Region-A include A410: A ribosomal subunit protein, A390: 213 a tRNA-2'phosphotransferase, and A440: a protein of unknown function. Interestingly, homologs of A410 214 and A390 are also co-duplicated with EPSPS in Bassia scoparia, a eudicot weed with a tandemly duplicated 215 EPSPS CNV 22 . Given the annotation of these proteins, it is unlikely they are directly involved in the EPSPS 216 CNV formation. B510 is the only significantly overexpressed (log fold-change: 5.2, p-value: 6.9e-11) gene 217 in Region-B ( Fig. 7; Table 2). Gene B510 encodes a RadA-like protein, a type of protein known to be 218 associated with EPSPS in glyphosate resistance in other prominent weed species such as kochia. This gene 219 involvement in the formation of the EPSPS CNV is unknown; however, its overexpression indicates it is 220 currently active. RadA proteins are DNA-dependent ATPase that process DNA recombination 221 intermediates and are therefore involved in repairing DNA breaks 28 . These proteins are particularly 222 interesting in the case of EPSPS duplication due to their role in catalyzing homologous recombination. 223 Whether or not RadA is directly involved in the duplication of the EPSPS loci from various weed species 224 or merely coincidental is open for examination. 225 226

Subtelomeres in Eleusine indica 227
Whole genome alignment reveals highly conserved subtelomeric repeat sequence from the 228 EPSPS-Cassette near the ends of many of the assembled chromosomes in both the GS and GR goosegrass 229 genomes (Fig. 2). Interestingly, over twice as many subtelomeres at higher average copy number in each 230 region was assembled in the GR genome (Fig. 2). There is an especially high copy number of 231 subtelomeric repeats in chromosomes one, three, four, and seven of the GR genome that could be sites of 232 meiotic recombination and subtelomere rearrangement in and between chromosomes (Fig. 2) 29 . It has 233 been previously shown in Phaseolus vulgaris that similar subtelomeric sequences are predisposed to 234 unequal intra-strand homologous recombination 30 commonly resulting in large duplications of whole 235 pathogen resistance gene pathways. 236 The 452bp-long subtelomeric repeat unit that flanks the EPSPS-Cassette is most like the 237 subtelomeric repeat region of second contig that comprises chromosome three from the GR genome 238 (99.556%), indicating chromosome three is likely the location of the EPSPS-Cassette in GR plants.

239
EPSPS is natively on chromosome three in both assembled genomes. The subtelomeric sequence of 240 chromosome four in both the GS and GR genomes are also highly similar to the EPSPS-Cassette genomes are identical (Fig. 6). Work by other researchers using FISH cytometry have shown that the 246 EPSPS CNVs in goosegrass on one or possibly two chromosomes 27 . Given the sequence similarity of the 247 subtelomeric repeat in the EPSPS-Cassette and the subtelomeres on chromosomes three and four, 248 translocation of the EPSPS-Cassette between these two regions through non-homologous recombination 249 seems feasible; however, we assembled chromosome four of the GR genome from telomere to telomere, 250 completely through the subtelomeric region, and found no evidence of the EPSPS-Cassette on 251 chromosome four in this population. 252 253

Subtelomeres in plant evolution 254
Subtelomeres of eukaryotic organisms are hotspots for adaptive evolution due to frequent, error-255 prone recombination events during meiosis that lead to rapidly changing genes. In common bean (Phaseolus 256 vulgaris), segmental duplications, sometimes up to 100kb-long, of disease resistance genes located in the 257 subtelomeres are facilitated by non-homologous end joining and frequent interchromosomal 258 recombination 30 . The large duplications of these distal disease resistance genes allow for divergence of 259 homologous genes into paralogs to allow novel resistance that accounts for the rapid evolution of pathogens. 260 Analogous to the generation of novel disease resistance genes in the subtelomeres of common bean, CNV 261 increases variation and diversity of virulence genes in the subtelomeres of several eukaryotic pathogens 262 including genetic variation and our findings, we have strong evidence that E. Indica EPSPS copy number variation is 280 due to unequal crossing over in the subtelomeres of this species after an initial translocation event. 281 282

Conclusion 283
The large population size, plastic genomes, and large selection pressures exerted on weed species 284 makes weeds an excellent system to study adaptive genome evolution. As we explore these genomes, we 285 continuously find new ways that plant genomes innovate and overcome various stresses and discover 286 exactly what makes weedy species like goosegrass such a successful survivor. The goals of this research 287 were to both obtain high quality genomic resources for E. indica, a major, global weed, and to use those 288 resources to investigate the genomic rearrangements and mechanism(s) that perpetrated EPSPS gene 289 duplication and therefore glyphosate resistance in this species. By assembling both a GS and GR E. indica 290 genome, genomic resequencing eight individuals of both populations, performing RNA-seq on eight 291 individuals from each population, and manually curating the assembly surrounding the EPSPS locus, we 292 have discovered that the EPSPS gene in GR E. indica has been duplicated, fused with another part of the 293 genome, and inserted in one or more of the subtelomeric regions of the genome. We hypothesize that after 294 this initial translocation and fusion, EPSPS duplication has carried on through unequal crossing over of the 295 subtelomeres, facilitated by the high amounts of similarity between the subtelomeric sequence on the ends 296 of chromosomes three. This is the first report of an herbicide resistance trait brought about by this type of 297 genomic rearrangement and adds to both our knowledge of herbicide resistance evolution, and to the 298 relatively limited information we have about the importance of subtelomeres as rearrangement hotspots and 299 novel variation generators. 300 301 4. Methods 302

Glyphosate-susceptible (GS) and glyphosate-resistant (GR) populations of E. indica that have been 304
characterized in previous studies collected from Guangdong Province, China 4 . These populations were 305 purified and self-pollinated for increased homozygosity and consistency of glyphosate susceptibility or 306 resistance phenotypes. The GR population was confirmed to have EPSPS copy number variation by DNA 307 quantitative PCR before purification. 308 Seeds of purified GS and GR biotypes were sown on wet filter paper in Petri dishes in a climate 309 chamber at 28-30 °C, with 12h/12h light/dark period and 70% relative humidity. The two-leaf stage 310 seedlings were transplanted into 28 × 54 cm trays (50 plants per tray) filled with potting soil and grown in 311 a glasshouse. At the tillering stage, about ten individuals were randomly selected each from the GS and GR 312 E. indica population and characterized. Three tillers of each plant were separated and repotted (one tiller 313 per pot, 60 pots in total). One tiller of each plant was used for glyphosate resistance and susceptibility 314 phenotyping, one for EPSPS CNV estimation and one for subsequent sequencing. 315 For glyphosate resistance and susceptibility phenotyping, one regrowth tiller (three days after tiller 316 cloning) was treated with commercial glyphosate (41% glyphosate isopropylamine salt, 400 g ai ha-1 for 317 GS and 1600 g ai ha-1 for GR), and GR (i.e., survivors) and GS (i.e., killed) phenotypes were determined 318 three weeks after treatment. EPSPS CNV was again assessed in the resistant population to ensure the CNV 319 event was still present before genomics work began. (version 2.30.0) as a measure of data reduction before further annotation. Isoseq reads were then mapped 374 to both repeat-masked genomes using Minimap2 50 (version 2.24) to determine sites of transcription. The 375 resulting Sequence Alignment Map (SAM) files were converted into Binary Alignment Map (BAM) files 376 using the SAMtools 51 (version 1.11) view command before being collapsed using cDNA Cupcake 52 377 (version 28.0). The genomes, collapsed cDNA Cupcake outputs, repeat libraries from RepeatModeler, and 378 a protein FASTA file from a close relative, Eleusine corocana (Phytozome genome ID: 560), were fed into 379 MAKER 53 (version) to predict the genomic coordinates of putative gene models. Genes that produced 380 proteins under 32 amino acids long were removed from further annotation with only the longest proteins 381 from each gene and unique untranslated regions (UTRs) used for functional annotation.

Genome-resequencing and transcriptomics 390
Illumina 150bp paired-end sequences were generated from the DNA of eight GS and eight GR 391 individuals from the above-named populations. DNA was extracted using leaf material from untreated 392 tillers of corresponding GS and GR plants using the Plant Genomic DNA kit (TIANGEN, Beijing, China).

393
DNA was sequenced using the Illumina HiSeq X Ten sequencing platform (Illumina Inc., San Diego, CA, 394 USA) with an average of 30x coverage. Illumina reads were cleaned using FASTQ and aligned to the 395 susceptible genome using HiSat2 61 (version 2.1.0) using standard options for paired-end reads. CNVnator 62 396 (version 0.4.1) was used to scan the read depth from the alignments in 5kb windows to roughly call regions 397 of the genome that deviated significantly from the average read depth. CNVnator outputs were intersected 398 using bedtools 49 (version 2.30.0) intersect so that regions that were amplified in all eight GR individuals 399 but none of the GS individuals were identified. Only two such regions were discovered. The region 400 containing EPSPS on chromosome three was labeled "Region-A", while the other, was labeled as "Region-401 B" for further analysis.

402
Illumina 150bp paired-end sequences were generated from the RNA of the same eight GS and eight 403 GR individuals that were used for genome resequencing. RNA was extracted from the leave sheath material 404 using the mirVana miRNA Isolation Kit (Ambion). Illumina reads were cleaned and aligned to the GS 405 genome predicted transcriptome using HiSat2 (version 2.1.0) using standard options for paired-end reads. 406 Alignments were converted into count tables using SAMtools (version 1.11). The count table was loaded  407 into R (version 4.2.0) and differential gene expression was calculated using package edgeR 63 (version 408 3.38.1) quasi-likelihood F-tests. 409 410 4.6 Investigation of the EPSPS-Cassette and subtelomeres 411 The EPSPS-Cassette model was resolved by first using BLAST to identify all contigs containing 412 EPSPS, Region-A, and Region-B. Contigs of interest were self-aligned in YASS 64 to visualize 413 macrostructure, especially repeat structure. Contigs were manually assembled into putative models based 414 on their repeat macrostructure. Contig junctions of the putative models were confirmed by aligning genomic 415 reads to them using HiSat2 (version 2.1.0). The large subtelomeric repeat regions flanking the reverse-416 forward EPSPS-Cassette duplications were not able to be assembled completely using this method. The 417 locations and relatedness of sequences similar to the 452bp subtelomeric repeat unit were found using 418 Minimap2 (version 2.24) and BLAST. 419 420 4.7 Plot generation 421 Circos 65 (version 0.69-9) was used to visually summarize the overall E. indica genome (Fig. 1).

422
Coverage windows used to make the Circos tracks were generated using bedtools (version 2.30.0; Fig. 1 Fig. 5). The EPSPS Cassette was visualized using YASS ( Fig. 4a & Fig. 4b). Differential expression 427 between eight GS and eight GR E. indica individuals was visualized using ggplot2 67 (version 3.4.0) in R 428 (version 3.6.0). The subtelomere relatedness tree was generated using RAXML-NG 68 (version 1.   pairs. Red links indicate large inversions of synteny between the genomes. Black links represent Region-A and Region-B of the EPSPS-cassette in 617 their native locations. Numbers in boxes above and below the ideogram indicate the number of copies of sub-telomeric repeats at each locus. A bold 618 letter "T" on the karyotype represents ends of the chromosomes where the assembly reaches the telomeres. 619 620 621 from RNA-Seq data shows over-expressed (red) and under-expressed (blue) genes in GR E. indica 648 individuals with labels for all identified genes within the EPSPS cassette. Gene labels with a non-integer 649 numerical value represent splice variants of the same gene. Genes below a p-value of 0.01 or a fold 650 change value below two were considered not differentially expressed (grey) between the two treatment 651 groups. blue) E. indica genomes to the subtelomeric sequence found on the EPSPS-Cassette (green). 658 Chromosomes at branch tips further from Cassette are less related to Cassette than chromsomes closer to 659 Cassette. Branch distance is based on BLAST similarity. The sequence with the highest relatedness to the 660 EPSPS-Cassette subtelomere sequence on each chromsome were used as representative sequences to 661 make a tree. 662