Unexpected Micro-Spatial Scale Genomic Diversity of the Bloom-Forming Cyanobacterium Aphanizomenon Gracile and its Phycosphere

Background: Genotypic diversity within cyanobacteria populations, partly driven by horizontal gene transfers, is a key factor of their colonization success, allowing them to cope with spatio-temporal fluctuations of the environmental conditions. By providing complementary functions, such as oxidative stress protection, heterotrophic bacteria composing phycospheres play also an essential role in cyanobacteria adaptation. Aphanizomenon gracile is a species of toxinogen cyanobacteria blooming worldwide with severe consequences for fresh and brackish water ecosystems. While marker heterogeneity surveys have shown that harmful cyanobacteria blooms were not clonal populations, the real extent of genetic and functional diversity within a population of freshwater cyanobacteria, including A. gracile , and their associated phycospheres remains unclear. Results: Here, comparative omics of four monoclonal strains of A. gracile isolated from a single drop of water reveals extensive heterogeneity of chemotypes and gene contents, despite constrained genome size and high similarity indices. These variations are remarkably associated with horizontal gene transfers (HGT) of biosynthetic gene clusters (BCG), and a novel siphophage infecting A. gracile displaying characteristics of temperate phages appears to participate to this genotypic diversification. In spite of high variability in heterotrophic taxa relative abundances, A. gracile phycospheres displayed an apparent functional redundancy implying biosynthesis saccharides and carotenoids. It also reveals the production of various yet uncharacterized variants of puwainaphycins, in the three strains that specifically present puwainaphycin gene cluster. In total, in the 1400 detected analytes only 64 (4.57%) were synthesized in all cultures. The great majority of them (75.14%) were specific to a single strain, illustrating the large molecular diversity retrieved in the metabolome of the four strains beyond their genomic similarities and differences. Aphanizomenon gracile by comparing strains isolated from a single drop of freshwater during a bloom event, and the taxonomic and functional diversity of associated cyanospheres. We present the first sequenced genomes of Aphanizomenon gracile , and we show that strains isolated from a single drop of water display unexpectedly high genetic diversity, despite a constrained genome size (5.33±0.06 Mb) and the absence of self-evident differentiation at the small subunit 16S rRNA gene sequence and ANI levels 48 .

cyanobacteria individuals to cooperate together and with heterotrophic bacteria in a black queen hypothesis compatible way.

Background
Aphanizomenon is a genus of filamentous harmful bloom-forming cyanobacteria (cyanoHAB) occurring worldwide in brackish and freshwater ecosystems 1 . Aphanizomenon can dominate phytoplanktonic communities in an exclusive way 2 , leading to blooms with multiple deleterious effects, including trophic network perturbations, microbial loop alterations, and release of a particularly broad range of toxic metabolites (e.g. microcystins, saxitoxins, cylindrospermopsins) 1,3 .
Furthermore, its ability to fix inorganic nitrogen makes Aphanizomenon responsible for the subsequent growth of other non-nitrogen-fixing cyanoHAB such as Microcystis 4 . Despite the ubiquity and the high number of described strains of Aphanizomenon, little is known about the genetic diversity of the genus and genomic data are only available for few strains belonging to the species A.
The ecological success of cyanobacteria partly relies on the genotypic diversity existing within a single population [5][6][7] , allowing populations to cope with spatial and temporal environmental variations. In freshwater ecosystems, these population studies remain mainly focused on 16S rRNAencoding gene fragment or internal transcribed spacer (ITS) sequencing and the presence/absence of cyanotoxin-biosynthesis genes in cyanobacterial strains from eutrophic lakes (e.g. Microcystis aeruginosa, Planktothrix agardhii, Raphidiopsis (anc. Cylindrospermopsis) raciborskii) 5,8,9 because of their environmental and health impacts. However, infra-specific genetic diversity in many freshwater cyanobacteria is obviously not restricted to these markers, and whole genome sequencing highlighted that populations often consist of diverse conspecific strain that display the same morphotype, yet sometimes only share a limited fraction of their genes, as low as 54% for example for some clones of M. aeruginosa 10 . Each strain thus contains only a subsample of the global population-level pangenome. The same can be observed for the metabolites, which display a high strain-specific profile, extending far beyond more cyanotoxin production capabilities 11 . Such infraspecific diversity is suspected to result in increased capabilities to cope with different selective pressures such as temperature, light intensity or nutrients availability, and improved exploitation of the water column and environmental micro-niches. However, little data is currently available documenting the extent of genome and chemotype diversity at a population level. Other key players that participate in cyanobacteria population regulation have been overlooked. Albeit known for a long time, the beneficial interactions established between cyanobacteria and associated heterotrophic bacteria 12 composing the so-called phycosphere (or cyanosphere) have recently enjoyed renewed interest 9,13,14 . These heterotrophs have been shown to exploit metabolites released by cyanobacteria, which on their side benefit from reduced concentrations of reactive oxygen species in vitro, otherwise deleterious for photosynthesis, thanks to cyanosphere-produced peroxyredoxins and catalases 13,15 . The broad range of metabolic capabilities carried out by cyanobacteria and the distantly-related bacterial phyla composing the cyanosphere may also promote a higher diversity of interactions. Another important yet neglected group are the cyanophages. Indeed, temperate cyanophages, mostly studied in marine environments 16 , are thought to drive the genomic plasticity associated with genotypic diversity of cyanobacteria, and viral lysis to trigger bloom senescence 17 .
In this study, we evaluate how representative one single individual can be of the potential pangenome of a blooming population of A. gracile, and attempt to decipher the interactions that may contribute to the adaptative success of this cyanobacterial species. For this, the inter-individual genomic and metabolite diversity of A. gracile are described from four monoclonal strains isolated from a single drop of water collected in an eutrophic freshwater peri-urban lake. Using Metagenome-Assembled Genomes (MAG) approach, we also identify members and features of the microbial community composing the A. gracile cyanosphere, which include heterotrophic bacteria and a novel cyanophage that may be involved in A. gracile population dynamics.

Site and sampling conditions
The lake of Champs-sur-Marne (48°51'50" N, 2°35'53" E) is located in the outer eastern suburb of Paris (France). This reservoir is a small (10.3 ha) shallow (mean depth of 2.4 m) water body isolated from the river system 18 , used for recreational activities, and surveyed weekly for cyanobacteria and toxins presences since 2005. Raw water was collected from the sub-surface level in July 2010 and brought back refrigerated to the lab.

Cyanobacterial isolation and culture conditions
A. gracile strains isolations from one droplet were conducted by repeated manual transfers of single filaments on solid or liquid Z8 media (with or without nitrogen) 19 , at least 3 times, under an inverted microscope (Nikon ECLIPSE TS100). Growing clones were later cultured in 25 cm 2 culture flasks (Nunc, Roskilde, Denmark) containing 10 mL of Z8 medium 19

Metagenome assemblies, annotation and curation
Scaffolds were assembled from MiSeq and Pacbio reads using SPAdes-based Unicycler hybridassembler, with default parameters, for the metagenome of each individual strain culture 21,22 . For each metagenome, nodes from assembly graphs were clustered using MyCC (k-mer size=4, minimal sequence size=1000) 23 and taxonomically annotated using Contig Annotation Tool 24 . 16S rRNAencoding genes were also extracted from these nodes using Metaxa 2 25 and annotated using ACT 26 .
All assemblies were pairwise-aligned using Megablast 27 (E-value ≤1E -10 ) and all sequences sharing at least 98 % similarity over the shortest sequence were considered as originating from the same genome. Congruent data between these diverse methodologies allowed to cluster together sequences from the different genomes constituting each metagenome (Metagenome Assembled Genomes, MAGs). In order to improve bacterial genome assemblies by simplifying assembly paths (i.e. to obtain longer nodes), nodes belonging to a same MAG detected in several metagenomes were aligned using CSAR 28 . Resulting scaffolds were then aligned to assembly paths which were then curated using Bandage, following the software recommendations 29 . MAG assembly completeness was assessed using CheckM 30 . Paired-reads of each metagenome were mapped on each MAG using bowtie2 31 , and coverages calculated using samtools 32 . GC statistics were calculated using custom python scripts. Open Reading Frames (ORF) were predicted using Prodigal 33 . Functional annotations including KEGG were obtained using eggnog-mapper with default parameters 34 . Genome maps were generated using Circos 35 .

A. gracile comparative genomics
Genes encoding rRNA and tRNA were detected using Barrnap 36 and Aragorn 37 , respectively. ITS sequences were manually extracted using 16S and 23S rRNA-encoding genes coordinates.
Sequences from all A. gracile genomes were pairwise aligned using Blastn 27 . Clusters of Orthologous genes (COG) were defined using OrthoFinder (E-value≤1E -05 , identity≥70%) 38 . OG frequency was calculated as the proportion of A. gracile genome possessing at least one member of a COG. Synteny index for each genome position was then calculated based on COG triplets using custom perl scripts.
Relationship between OG frequency and synteny index was studied using a Pearson's Chi-square test (H0: independence hypothesis, p-value<0.01). A. gracile core genome was defined as constituted by all COGs with an OG frequency of 100%, all other COGs and singletons being considered as belonging to flexible genes set. Functional comparison between these both fractions was performed using Student tests (H0: no difference, p-value<0.01).

Cyanobacteria gene content comparisons
A database was constructed with the four A. gracile MAGs and 268 unique complete genomes of cyanobacteria downloaded from NCBI in addition to information about sampling date and location.
ANI was calculated by Mummer between each pair of genomes belonging to a same species using pyani (https://github.com/widdowquinn/pyani). For all members of pairs displaying an ANI greater than or equal to the minimal ANI between our four A. gracile strains genomes (i.e. 99.08%), clusters of orthologous genes were retrieved as described above. The gene content dissimilarity in all couples were assessed by the Bray-Curtis distance calculated from the matrix of COG presence/absence in each genome using R vegan package 39 . For clarity purposes, only species gathering at least 3 couples were displayed on the graph.
Biosynthetic gene clusters were retrieved from genomes using AntiSmash v5.1.2 tool 40 , except for saxitoxin gene clusters that have been additionally searched and confirmed after specific Blast search with sxtA using the MicroScope platform 41 .

Bacteriophage MAG characterization and analysis
A circular MAG assembled from PMC627 metagenome was aligned against nr using Megablast.
Given the high similarity with its single significant hit (namely Cyanophage vB_AphaS-CL131, see Results), this MAG was then considered as a cyanophage genome and named Ag627 phage.
Alignment between both sequences was visualized using Kablammo 42 , and Average Nucleotide Identity determined using ANI calculator 43 . In order to assess its replication strategy, an analysis based on Mavrich et al. 44 was performed. A bacteriophage genomes dataset was constructed composed by Ag627 and vB_AphaS-CL131 genomes as well as all "microbial virus" genomes available in NCBI RefSeq (2,528). Jaccard distances between each pair of genomes, proportional to the nucleotide distance, was calculated using MASH (k=16, s=10,000) 45 . In order to reduce the time of the following procedures, only genomes displaying a Jaccard distance <=0.15 with at least one other genome were kept. ORFs were then predicted for the remaining 1,871 genomes, and groups of homologous genes clusterized using Orthofinder with default parameters 38 . Gene content dissimilarity between every genome pair was then assessed by the Bray-Curtis distance calculated from cluster of homologous genes presence/absence in all genome matrix 39 .

Metabolite content characterization by high-resolution mass spectrometry
Metabolite content was extracted from each strain cellular biomass using 0.75% methanol (100 L for 1 mg dry mass) and further analysed on high resolution mass spectrometer, as previously described 46 . Shortly, Ultra High Performance Liquid Chromatography (UHPLC) was performed on 2 μL of each of the metabolite extracts using a Polar Advances II 2.5-m pore C18 column (Thermo®) at a 300 μL.min -1 flow rate with a linear gradient of acetonitrile in 0.1% formic acid (5 to 90% in 21 min). The metabolite contents were analysed in triplicates using an electrospray ionization hybrid quadrupole time-of-flight (ESI-QqTOF) high resolution mass spectrometer (Maxis II ETD, Bruker) on positive autoMSMS mode with information dependent acquisition (IDA), on the 50-1500 m/z rang at 2 Hz or between 2-8 Hz speed, for MS and MS/MS respectively, according to relative intensity of parent ions, in consecutive cycle times of 2.5 s, with an active exclusion of previously analysed parents. The data were analysed with MetaboScape 4.0 software for internal recalibration (<0.5 ppm), molecular feature search and MGF export. Peak lists were generated from MS/MS spectra between 1 and 15 min, with a filtering noise threshold at 0.1% maximal intensity and combining various charge states and related isotopic forms. A molecular network then was created using the online workflow at Global Natural Products Social molecular networking (GNPS) (http://gnps.ucsd.edu) 47 as previously described 11 . The clustered spectra of the network were annotated by comparing monoisotopic mass to our in-house cyanobacteria metabolite databases according to MS and MS/MS fragmentation pattern matches. Molecular networks were visualized using Cytoscape 3.2.1.

Metagenome-assembled genomes acquisition
After suitable quality trimming and filtering of reads from both technologies, hybrid assemblies generated between 294 (PMC644) and 568 (PMC638) scaffolds, displaying coverage median values of from 18.15X (PMC638) to 26.18X (PMC644). All sequencing and assembly statistics are summarized in Table S1. Synergistic scaffold binning approach allowed to characterize between three (PMC644) and 11 (PMC649) MAGs in each culture sample. Then, scaffold pairwise alignments clustered these data into 25 unique MAGs (see Table 1). Four MAGs correspond to the four divergent A. gracile genomes, 20 MAGs correspond to heterotrophic bacteria genomes including six circularized bacterial genomes, and one corresponds to a complete viral genome.

A. gracile genome features
A. gracile genomes assembly graphs were highly entangled, with between 73 (PMC627) and 316 (PMC638) short and highly connected nodes (data not shown), indicating a substantial amount of repeated sequences. The synergistic approach used to improve assembly allowed to merge  (Table 1).
On 21 tRNAs, only 10 displayed the same copy number in all strains (see Table S2). Comparable differences were observed for the 16S-ITS-23S rRNA operon copy number between genomes, from three (PMC649) to six copies (PMC627).

A. gracile Clusters of Orthologous Genes
In total 19,640 Open Reading Frames (ORFs) were predicted onto these genomes, with a number by strain comprised between 4,836 (PMC638) and 4,959 (PMC627) (  (Figure 1a-b). 13.56% and 8.99% of the COGs were shared exclusively between 3 and 2 genomes, respectively. Small numbers of COGs (between 1 to 8 by strain) were only composed by strain specific in-paralogs, but genomes displayed a higher number of strain specific single copy genes (singletons), between 212 to 306, representing up to 6.17% of the PMC627 gene content for instance. Given COGs and singletons, the current A. gracile pangenome was composed in total of 5,925 coding genes.

A. gracile genome synteny
Considering gene triplet frames, the average synteny index was 80.55±1.31% for one genome.
The regions displaying a synteny index of 100% (conserved in all strains) represented 68.22±0.49% of the genome lengths (see Figure 1.c for the PMC627 example). Genomic regions with synteny values of 66 and 33% represented 15.37 ±2.73% and 6.61±0.68% respectively of the genome length, and regions without synteny 9.79±2.20% (Figure 1.a, c). Genomic regions with synteny index <100% were evenly distributed along the genomes, and may gather up to 40 consecutive genes in PMC627.
This distribution seems to share a pattern similar with the distribution along the genome of COGs frequency, i.e. the percentage of A. gracile genomes carrying them, what was confirmed by a Pearson's Chi-square test (H0 independence hypothesis rejected with p-value<0.001, Figure 1.a).

SSU ribosomal RNA encoding gene locus diversity
Within a genome, no dissimilarity was observed among the different 16S rRNA or 23S rRNA copies. 16S rRNA sequences, with a conserved size of 1,485 bp, were identical between PMC638 and  Remarkably, the repertoires of biosynthetic gene clusters (BGC) specifically involved in secondary metabolite production were unique to each A. gracile strain, but always broader than those detected in the genomes of their close relatives (Figure 2

Metabolomic profiles characterization
The high potential of secondary metabolite production was confirmed by direct metabolomic analysis of the biomass extracts by high resolution mass spectrometry (HRMS) and visualization with molecular networks (Figure 3.a,b), allowing for instance the detection of high amounts of saxitoxin and neosaxitoxin in both PMC627 and PMC638 strains (Figure 3.a). The global molecular network performed by GNPS (Figure 3.b) further highlights the production of a large set of metabolites by the four strains, including fatty acids, amino acids, nucleic acids, glycolipids, phospholipids, saccharides and carotenoids. It also reveals the production of various yet uncharacterized variants of puwainaphycins, in the three strains that specifically present puwainaphycin gene cluster. In total, in the 1400 detected analytes only 64 (4.57%) were synthesized in all cultures. The great majority of them (75.14%) were specific to a single strain, illustrating the large molecular diversity retrieved in the metabolome of the four strains beyond their genomic similarities and differences.   AntA/AntB antirepressor), as well as a CRISPR-associated protein (Cas_csx3) were detected by this approach, but no functions classically associated in related to lysogeny were found (e.g. integrases, excisionases). Complementary phylogenetic analysis showed that both phages constitute a robustly supported clade, closely related to other Siphoviridae infecting Cyanobacteria (see Figure S1). These results leaded us to henceforth consider the MAG as a genome of a siphophage infecting Aphanizomenon gracile called vB_AphaS_Ag627.

The evolutionary mode of phage vB_AphaS_Ag627
Comparison of nucleotide distance and gene content dissimilarity between available bacteriophage genomes resulted in a plot where couples of genomes were distributed following two modalities ( Figure 5.c). One group of genome pairs were dispersed along a straight line, meaning that gene content dissimilarity between members was proportional to their nucleotide distance. This distribution is typical for genomes having essentially diverged from one another by mutation events and was called Low Gene Content Flux evolutionary mode (LGCF) 44  A. gracile cyanosphere metagenome-based community structure  Figure 6.a), were only abundant in PMC649 metagenome. Alphaproteobacteria and Bacteroidetes taxa, which gathered 10 and 8 MAGs, respectively (Table 2) Despite similar community structures in terms of taxa relative abundance between PMC638 and PMC644, these ratios were much greater in the latter for Alphaproteobacteria (4.16) and Bacteroidetes

Functional comparative metagenomics
In total, genetic communities were constituted by 12,520 to 28 (Figure 6.c). These clusters clearly separated Alphaproteobacteria, Bacteroidetes and Cyanobacteria groups, indicating divergences of their functional gene contents. Interestingly, the four points corresponding to the Cyanobacteria were more tightly packed together than those corresponding to Alphaproteobacteria and Bacteroidetes, suggesting less variation within the Aphanizomenon strains cluster than within the different associated Alphaproteobacteria or Bacteroidetes. An analysis of KEGG modules completion was performed, focused on these three taxa for statistical and exhaustiveness reasons because they were all present in all metagenomes and gathered the great majority of KO annotated genes (Figure 6.d). In total, KO were distributed into 289 KEGG modules, among which 106 showed significant differences in module completion between the three taxa (ANOVA, p-value<0.01, data not shown), confirming divergence observed in PcoA at the KO level.
Among all 289 functional modules retrieved in the three taxa abundant in all samples, namely Cyanobacteria, Alphaproteobacteria and Bacteroidetes, Aphanizomenon showed the lowest number of complete or nearly complete modules (Z-score closed to 1, Figure 6.d). Cyanobacterial protein functions have been associated to 164 modules, in which only 12 displayed significantly greater completion in Cyanobacteria compared to other taxa (pairwise Student test with Bonferroni correction, p-value<0.01). Half of them were related to "Energy Metabolism" and involved in nitrogen fixation, photosystems and ATP synthesis (Figure 6.d, Table S3). The other modules were associated with "Lipid Metabolism" (lactosylceramide biosynthesis), "Cofactors and Vitamins" (ubiquinone and tocopherol/tocotorienol biosynthesis), "Terpenoids/Polyketides" (beta-carotene biosynthesis), "Other secondary metabolites biosynthesis" (staphyloferrin B synthesis) and "Gene set" functional categories (vancomycin resistance). Bacteroidetes KO were assigned to a larger repertoire than Cyanobacteria, with a number of 227 modules, but their completion levels were generally low, especially in "Energy metabolism", "Biosynthesis of other secondary metabolites", "Xenobiotics biodegradation", and "Gene set" functional categories (Figure 6.d). Only three modules showed a significant greater completion in Bacteroidetes, involved in citrate cycle ("Carbohydrates metabolism"), coenzyme M biosynthesis ("Energy metabolism") and tetrahydrofolate biosynthesis ("Metabolism of cofactor and vitamins", Figure 6.d, Table S3) and effectors, cagA pathogenicity island signature, vancomycin and multidrug resistances) ( Figure   6.d, Table S3).

DISCUSSION
Evaluation of micro-scale biodiversity within a single cyanobacterial population is a cornerstone for the establishment of a holistic theoretical framework of cyanoHAB bloom formation 9 . In this study, we investigated the micro-scale genomic heterogeneity within Aphanizomenon gracile by comparing strains isolated from a single drop of freshwater during a bloom event, and the taxonomic and functional diversity of associated cyanospheres. We present the first sequenced genomes of Aphanizomenon gracile, and we show that strains isolated from a single drop of water display unexpectedly high genetic diversity, despite a constrained genome size (5.33±0.06 Mb) and the absence of self-evident differentiation at the small subunit 16S rRNA gene sequence and ANI levels 48 .
A striking representation of the micro-scale A. gracile genomic heterogeneity is given by the low synteny scores, only slightly higher than those reported among 12 Microcystis aeruginosa strains sampled from distant geographical locations 49 (80.55±1.31% vs 76±4%, respectively). Remarkably, low synteny in an A. gracile is closely associated with the presence of genes specific to this strain throughout its genome, rather than intra-genomic rearrangements, suggesting a strong impact of HGT its growth could also sustain these associations within the phycosphere. Some others, as Saprospirales and Sphingomonadales, have been shown to grow on cyanobacterial exudates, including secondary metabolites 76 , or to rapidly increase in abundance after bloom collapse in lakes (Flavobacterales, Sphingobacterales) 72 . This suggests that such bacteria are recruited in the phycosphere by cyanobacteria supply of metabolites. If all but one of these bacteria are detectable in every of the four cyanospheres, suggesting host fidelity, their relative abundances fluctuate greatly from one to another.
These variations could be at least partially related to the different metabolic profiles displayed by A.
gracile strains. Indeed, public good exchanges, partly mediated by extracellular vesicles 77 , are thought to structure the association between cyanobacteria and their cyanospheres 12,14,15 . If the community structures differ among cyanospheres, they seem to be largely functionally redundant.
Functional module completeness analysis suggests a reduced metabolic capability range in Aphanizomenon, compared to Bacteroidetes and mostly Alphaproteobacteria, but this striking difference could simply reflect the bias inherent to functional databases which are based on wellknown model micro-organisms highly divergent from cyanobacteria. However, in addition to products from photosynthesis and nitrogen fixation, specifically performed by A. gracile, which are already thought to be shared with the cyanosphere 14 , many secondary metabolites, vitamins (B2, B3 and B6 or E) and cofactors, mostly synthesized by Cyanobacteria and Alphaproteobacteria could also constitute public goods structuring the community. In contrast, only one complete module responsible for the biosynthesis of tetrahydrofolate, a cofactor involved in nucleotide metabolism, is found in Bacteroidetes.

CONCLUSIONS
In conclusion, exploration of four strains of A. gracile co-existing during a bloom event revealed levels of inter-individual genomic variability higher than previously recorded among conspecific strains of cyanobacteria. Although limited by the number of studied metagenomes, this study suggests that an even larger size of the A. gracile pangenome may exist at a single bloom scale, and that a strong spatial heterogeneity of cyanobacterial genotypes and associated cyanospheres may also occur, questioning the cyanobacterial population dynamics and numerous bacterial interactions leading to such events. A consequence is that genome sequencing of many strains is mandatory to gain proper overview of the potential of a whole population. This genotypic diversification is likely mostly mediated by temperate phages, including vB_AphaS_Ag627, the one newly identified during this study. Beyond the diversity statement, further experimental setups should further explore biotic interactions occurring on eutrophic lakes to decipher the mechanisms that regulate cyanoHAB populations until bloom collapse.