Diversity of CPRs in freshwater lakes and their general genome characteristics
A total of 282 CPR MAGs (> 40% completeness, < 5% contamination) from 8 classes were assembled from 119 freshwater metagenomes collected from 17 lakes. Their estimated genome sizes ranged from ~ 0.5 to 2.5 Mbp (median: 1.02 Mbp, Fig. 1; assembly length ~ 0.2–1.5 Mbp, median ~ 0.63 Mbp, Supplementary Table S3). When compared to organisms with known lifestyles from RefSeq r81 (Fig. 1, Supplementary Table S7), CPRs have genome sizes and numbers of genes comparable to those of obligate intra- or extracellular symbionts/parasites. Nevertheless, their coding density (median: 89.47%; range: 76–95%) and GC content (median: 42.04%; range: 24–63%) in general resemble free-living organisms or facultative intra- or extracellular symbionts/parasites.
A phylogeny was generated using 38 single copy genes (SCGs)  for 1012 representative CPR genomes retrieved from GTDB r89 together with 171 dereplicated freshwater genomes obtained in this study (Fig. 2, Supplementary Table S4). In the phylogenetic tree, no clear grouping was observed based on either the genome size, isolation source or the trophic state of the lake. Metagenomic fragment recruitment was used to estimate the abundance of different CPR MAGs in 119 freshwater metagenomes. Coverage per Gbp values for each bin in their own metagenome varied between 0.02 and 14.36. The highest coverage was recovered for MAGs obtained in the hypolimnion of Lake Ikeda (~ 2–8.1) and the epilimnion of Lake Zurich (6.7–14.3) (Supplementary Table S6). Based on 16S rRNA gene data, Lake Ikeda and Jiricka pond had the highest percentage of CPRs (maximum values for these lakes were 10.89 and 21.05%, respectively), especially of sequences affiliated to the classes Paceibacteria, ABY1 and Microgenomatia (Supplementary Figure S3). Gracilibacteria and Saccharimonadia appear to be only minor components of the lake communities (~ 1% abundance) according to the 16S rRNA gene data. Nevertheless, these quantitative results might underestimate the true abundance of CPRs in our samples as some free or unattached small cells could have passed through the 0.22 µm membrane filters.
Fragment recruitment analyses suggested MAGs were generally specific to the lake of origin, although a few exceptions to this rule were observed. Almost identical genomes from the hypolimnion of Lake Thun and both epi- and hypolimnion of Lake Traunsee were recovered in the hypolimnion of Lake Maggiore (ANI value of 98.56–98.83%), lakes located at relatively short distances (~ 400 km between Lake Traunsee and Maggiore and ~ 100 km between Lake Thun and Maggiore), while MAGs assembled from Lake Baikal were also found in Lake Biwa (ANI value of 99.53%, ~ 3000 km distance between lakes) (Supplementary Table S6, S12).
Genome replication rates in freshwater CPRs, based on ori/ter values provided by GRiD, varied between 1–1.35 (Supplementary Table S13, Fig. 3A), indicating slow growth or even stagnation at the time of sampling. Doubling time estimates for CPRs, symbionts and free-living bacteria followed a binomial distribution (Fig. 3B), with the main peak in case of CPRs and free-living microbes around 4h, whereas symbionts were predicted to replicate slower (median ~ 7.5h).
Assessment of different lifestyles in the CPR group
Natural abundances and visualization of most CPR lineages has remained elusive till now. To amend this, we designed eight FISH probes targeting different CPR lineages from four classes: ABY1 (2 probes), Paceibacteria (3 probes), Gracilibacteria (2 probes) and Saccharimonadia (1 probe) (Fig. 2, Supplementary Table S11, Supplementary Figure S2). This approach enabled the visualization of distinct CPR groups and brought into light new evidence about their potential life strategies (Fig. 4; Supplementary Figures S4-S11). A double hybridization with probe EUB I-III [79, 80] was possible only for 2 probes (SacA-77 and Pgri-121) as all other CPR lineages have > 1 mismatch to this general bacterial probe. Both double hybridizations resulted in a staining with both fluorochromes (Supplementary Figures S12-13), proving that the targeted CPR lineages are indeed bacteria. Negative controls using a nonspecific probe (NON338 ) and the CARD reaction without probe resulted in low, unspecific background signals, but no obvious staining of cells (Supplementary Figures S14-15). The very low abundances of individual CPR lineages targeted by our probes (Supplementary Figure S3) did not allow a precise quantification, however, multiple images were recorded, and cells could be sized (between 20 to 71 cells per probe, Supplementary Table S11). CPRs were generally small in size (0.36–0.70 µm length, 0.30–0.59 µm width, Supplementary Table S11, Supplementary Figure S16), but in the same range as genome-streamlined free-living freshwater microbes like ‘Ca. Nanopelagicales’ (0.33–0.50 µm length, 0.24–0.30 µm width ) or ‘Ca. Fonsibacter’ (0.38 µm length, 0.27 µm width ). However, the observed cell sizes could be partially a result of filter size (0.2 µm) used for this approach, as smaller cells might have passed through.
ABY1: While being similar in terms of metabolic potential (data not shown), the two targeted ABY1 families of the same order (SG8-24) showed slightly different lifestyle preferences (Fig. 4a, b; Supplementary Figures S4, S5). Members of candidate family UBA9934, targeted by probe ABY1b-1343 (Fig. 2, Supplementary Figure S2, S5) were found exclusively unattached to other cells, while members of family GWF2-40-263 (probe ABY1a-193, (Supplementary Figure S4) targeting several MAGs from this study and one MAG by Anantharaman et al.  were either free-living or attached to so-called ‘lake snow’, aggregates of living or decomposing microorganisms kept together by extracellular polymeric substances, and an important source of organic matter [84, 85]. Cell sizes for the 2 families were very similar, averaging 0.52 ± 0.12 µm in length and 0.46 ± 0.12 µm in width for GWF2-40-263 and 0.52 ± 0.14 µm by 0.45 ± 0.14 µm in case of UBA9934 (Supplementary Table S11, Supplementary Figure S16).
Paceibacteria: Representatives of the candidate family UBA11359 (probe ZE-1429) were identified as very small cocci (average size 0.35 ± 0.08 µm by 0.30 ± 0.06 µm), consistently associated with other larger prokaryotes (Fig. 4l, m, Supplementary Figure S11). Members of the genus GWA1-54-10 (order UBA9983_A; family UBA2163; aka. ‘Ca. Alderbacteria’) were visualized with two CARD-FISH probes (Adl1-132 and Adl2-134, Supplementary Table S11) and they were shown to be also small (average cell sizes 0.67 ± 0.2 µm by 0.59 ± 0.18 µm for Adl1-132 and 0.37 ± 0.1 µm by 0.31 ± 0.09 µm for Adl2-134). All microbes targeted by probe Adl1-132 appear to be free-living (Fig. 4c, d, Supplementary Figure S6). Only one free-living representative was observed with probe Adl2-134 (Supplementary Figure S7A, B), while all other cells targeted by this probe were associated with hosts with up to 10 small GWA1-54-10 cells surrounding a large prokaryotic cell (Fig. 4e-g).
Gracilibacteria: Gracilibacteria family LOWO2-01-FULL-3 (order UBA1369), targeted by probe Pgri-124 (average cell sizes 0.48 ± 0.13 µm by 0.43 ± 0.12 µm), was associated with various small prokaryotes and picocyanobacteria (Fig. 4j, Supplementary Figure S9) and therefore was not limited to a definite host. Another Gracilibacteria family (2-02-FULL-48-14) of the same order targeted by probe Pgri-99 (Supplementary Figure S8) had cell sizes of 0.49 ± 0.12 µm by 0.45 ± 0.12 µm. They were likewise associated with small prokaryotes and cyanobacteria (Fig. 4h) but were also found in close vicinity to ‘lake snow’ particles (Fig. 4i).
Saccharimonadia: Members of the family UBA10212, targeted by probe SacA-77, were observed as diplococci (Fig. 4k, Supplementary Figure S10). Although their genomes are highly reduced and they lack important metabolic pathways for survival (Supplementary Table S5), they were not always associated with other cells (Supplementary Figure S4). They were observed either as elongated (on average 0.70 ± 0.07 µm by 0.41 ± 0.05 µm) or approximately round (average sizes 0.49 ± 0.08 µm by 0.43 ± 0.08 µm) cells.
Metabolic capabilities in freshwater CPR groups
In terms of average gene composition among metabolic pathways, Gracilibacteria form a cluster with free-living bacteria, while all other CPR groups possess a much more depleted metabolic repertoire and are similar to known symbionts (Supplementary Figure S17). Gracilibacteria that were visualized by CARD-FISH (order UBA1369, formerly known as Perigrinibacteria) encode the core 3-carbon compound module of glycolysis, the genes for nucleotide sugar biosynthesis, pentose phosphate pathway (PPP), parts of the Calvin cycle and the biosynthesis of phosphoribosyl diphosphate (PRPP) that is required to produce both purines and pyrimidines (Fig. 5). The capacity for beta-oxidation of fatty-acids and the pyruvate carboxylase are missing in all Gracilibacteria genomes along with the genes required for the synthesis of cofactors (biotin and thiamine) involved in these functions, biotin and thiamine. The genes for NAD+ and THF biosynthesis are present, two cofactors involved in nucleotide synthesis. The pathways necessary to produce riboflavin, FMN and FAD were present in both Gracilibacteria orders. The same was true for pantothenate, although the enzymes required for its conversion to acetyl-CoA were not encoded in Gracilibacteria but were present in some Microgenomatia and Paceibacteria MAGs. In Gracilibacteria, acetyl-CoA can be obtained from pyruvate through the activity of pyruvate:ferredoxin oxidoreductase. Even though NADH dehydrogenase and the F-type ATPase are encoded by Gracilibacteria, together with a putative proton-pumping rhodopsin (Supplementary Table S15), terminal oxidases were not identified.
MAGs belonging to ABY1, Gracilibacteria, Microgenomatia, Paceibacteria and Saccharimonadia usually encode the part of the reductive pentose phosphate cycle (Calvin cycle) responsible for converting glyceraldehyde 3-phosphate (G3P) to ribulose-5P (Supplementary Table S5, Supplementary Figure S18) which is necessary for nucleotide biosynthesis. As appears common for CPRs [3, 10, 86], the Embden-Meyerhof glycolysis pathway is also rarely complete in our freshwater MAGs with only a few exceptions in Paceibacteria. In general, 6-phosphofructokinase and glucokinase/hexokinase are missing, with only the core module involving 3-carbon compounds being completely encoded in all groups. Glycolysis can still be achieved through a metabolic loop involving the pentose phosphate pathway  that is encoded in all classes (Supplementary Table S5). All genes encoding the enzymes involved in gluconeogenesis were present in freshwater Paceibacteria, but not in the same MAGs and therefore it is difficult to conclude if the pathway is complete. On the other hand, phosphoenolpyruvate carboxykinase, and sometimes fructose-1,6-bisphosphatase, showed patchy distributions or were missing completely in the other classes making gluconeogenesis impossible. As previously reported, enzymes involved in the Krebs cycle are not encoded in freshwater CPRs, indicating a fermentative lifestyle [25, 87].
As previously described [10, 12], the capacity for fermentation is widespread in CPRs, highlighted by the prevalence of lactate dehydrogenase in ~ 50% of the freshwater MAGs assembled in this study, with the vast majority of ABY1 and Gracilibacteria possessing this gene (Fig. 6). Less common (< 30% of genomes) are the Zn-dependent and short chain alcohol dehydrogenases (ADH) that catalyze the interconversion between acetaldehyde and ethanol. Pyruvate decarboxylase, the enzyme converting pyruvate directly into acetaldehyde was not identified in our freshwater MAGs, therefore the alcoholic fermentations could potentially occur with an additional step. Firstly, pyruvate would be converted into acetyl-CoA by pyruvate:ferredoxin oxidoreductase, then it would be further processed by acetaldehyde dehydrogenase (ALDH). The last step involves the conversion of acetaldehyde into ethanol by alcohol dehydrogenase (ADH). Both lactic and alcoholic fermentations are coupled with the oxidation of NAD(P)H to NAD(P)+, providing a steady supply of NAD + for powering glycolysis and the slow generation of ATP in the absence of an ETC. Although acetate kinase is common in Gracilibacteria, and to some extent in Paceibacteria and Saccharimonadia, how the acetyl-phosphate is obtained from acetyl-CoA in the last two classes is not clear, as the gene for phosphate acetyltransferase (pta) was observed only in Gracilibacteria. Nevertheless, if acetyl-P becomes available, these groups seem to be able to generate acetate by coupling this reaction with substrate level phosphorylation (Fig. 6).
All freshwater CPRs were found to encode carbohydrate-active enzymes (CAZy) with an average of 15 genes per genome (Supplementary Table S16), therefore in lower amounts than previously reported in a thermokarst lake . By far the most encountered enzymes belonged to the GT4 (involved in rhamnose degradation, found in > 95% of genomes) and GT2 (polysaccharide conversion, present in > 92.5% of genomes) families, representing ~ 35% and ~ 23% from the total number of identified CAZy. Enzymes for the degradation of chitin, cellulose and mannose were also identified, but in low abundances, ranging from 1.1–2.8% of total CAZy (Supplementary Table S16).
Though carbohydrate anabolism is limited, Microgenomatia and ABY1 (Supplementary Figure S18), and as mentioned before Gracilibacteria, can perform nucleotide sugar biosynthesis. The production of dTDP-L-rhamnose, a cell envelope component, is conserved in the classes ABY1, Kazan-3B-28, Microgenomatia, Gracilibacteria, Paceibacteria and Saccharimonadia, while the biosynthesis of ADP-L-glycero-D-manno-heptose is encoded only in Paceibacteria (Fig. 5, Supplementary Figure S18, Supplementary Table S5). In case of our Gracilibacteria freshwater MAGs, C5 isoprenoid biosynthesis is usually performed through a mevalonate pathway typical for eukaryotes/bacteria [3, 88], with only one exception, the UBA1369 order, that uses the common methylerythritol phosphate pathway for bacteria. Other classes seem able to perform only C10-C20 isoprenoid biosynthesis through a typical bacterial pathway (Fig. 5, Supplementary Table S5, Supplementary Figure S18). No ability whatsoever for the synthesis or beta-oxidation of fatty acids was detected in any group, thus the way in which CPRs produce their cellular membranes remains enigmatic.
It was previously hypothesized that CPR might enable phage infection as a somewhat risky source of nucleotides . By analyzing over 1300 high quality CPR genomes, CRISPR-Cas systems were detected in 1.68–6.36% (mean = 3.89%) of the MAGs in classes with > 50 genome representatives (for a detailed report about phage defense mechanisms see Additional File 1 and Supplementary Table S14), which puts them in a low range for bacteria . An alternative source for nucleotides is the uptake of free DNA from the environment. We identified competence-related DNA transformation transporters (ComEA/ComEC) in most of our MAGs, providing a stable mechanism for DNA uptake [7, 90]. Moreover, inosine monophosphate (IMP) can be synthesized from phosphoribosyl diphosphate by Gracilibacteria, Microgenomatia and Paceibacteria. Further processing of IMP to ATP occurs in ABY1, Gracilibacteria, Microgenomatia and Paceibacteria, but the synthesis of GTP is not completely encoded in Microgenomatia which lack guanylate kinase. Microgenomatia MAGs seem to possess all genes for uridine monophosphate (UMP) biosynthesis, but not the other CPR classes. In contrast, the ability to turn UMP into CTP is widespread in most CPR groups, as well as its conversion into TTP in ABY1, Gracilibacteria and Paceibacteria. Metabolic pathways for the synthesis of amino acids are usually absent, with some exceptions. For example, some MAGs in Microgenomatia and Paceibacteria encode the genes necessary to produce serine, while threonine can be apparently synthesized by ABY1 (Supplementary Figure S18), and possibly by Gracilibacteria and Paceibacteria. Genes for biosynthesis of lysine, proline and tryptophan are present in Paceibacteria and Gracilibacteria, the latter also encoding pathways for other aromatic and branched-chain amino acids (Fig. 5). Histidine degradation to glutamate is encoded only in Paceibacteria. Although freshwater CPRs of the ABY1 class encode restricted biosynthesis pathways, its members seem to possess numerous importers for tyrosine, branched-chain amino acids, multiple sugars, 3-phenyl propionate, polysaccharides, as well as transporters for ions, such as Fe2+, Mn2+, Mg2+, Zn2+, and nitrogen oxides. They are also able to export heavy metals, polysaccharides, and numerous antibiotics (Fig. 4, Supplementary Figure S18). Even though type II secretion systems (T2SS) were reported in CPRs [7, 90], we were able to detect only the presence of Type II/IV secretion ATPase GspE together with the inner membrane platform protein GspL . The remaining T2SS components were not detected, but general secretion (Sec) and twin-arginine-translocation Tat systems were found in our freshwater CPRs.
Subunits of electron transport chain in freshwater CPRs
Regarding the genes involved in generating an electron transport chain (ETC), subunits of the NADH dehydrogenase (complex 1) are present in > 30% of the freshwater MAGs, being common in ABY1, Gracilibacteria, Paceibacteria and Saccharimonadia. Also, three MAGs affiliated to Saccharimonadia (sampled from the oxygenated hypolimnions of lake Most and Řimov, 50m and 30m depth) and four belonging to Paceibacteria (recovered from the oxygenated hypolimnion of lake Thun and Ikeda, 180m and 100m depth) encode all subunits of cytochrome o oxidase (HCO), indicating a putative capacity for oxygenic respiration (complex IV). Additionally, five Paceibacteria MAGs recovered from the oxygenated hypolimnion of lake Thun and Maggiore (180m and 300m depth, respectively) seem to have a functional cytochrome bd-type quinol oxidase.
The phylogeny of HCO subunit I, which included 1439 representative sequences, showed that CPRs probably obtained this gene horizontally from Proteobacteria (Supplementary Figure S19), as previously proposed . The closest group to CPRs in this tree belongs to Gammaproteobacteria, more specifically the orders Legionellales and Thiotrichales. Most of these organisms are facultative or obligate intracellular parasites, which might imply that the association with a host facilitated the HGT. Unexpectedly, the order Parachlamydiales of Verrucomicrobia, comprising mainly endosymbionts of free-living amoebae , appears to have obtained this subunit from Saccharimonadia. In our freshwater MAGs, HCO subunits are adjacent, forming an operon that was likely acquired horizontally at one time point. The phylogeny of the other HCO subunits follows the same evolutionary pattern as subunit I, with CPR sequences diverging from Proteobacteria, and Parachlamydiales sequences radiating from within Saccharimonadia (Supplementary Figure S19). The ML trees generated for cytochrome bd-type oxidases (Supplementary Figure S20) show that both subunits follow the same evolutionary pattern, in which the genes in CPRs appear to be transferred from a proteobacterial source, forming a cluster together with cyanobacterial sequences.
Rhodopsins occurrence in CPRs
A total of 1326 CPR genomes (1032 GTDB representative genomes, 282 freshwater CPR assembled in this study, 12 MAGs analyzed by Jaffe et al. ) were screened for the presence of rhodopsins. We were able to detect 115 rhodopsin sequences in 86 genomes, out of which 17 were predicted to be proton-pumping rhodopsins while the rest had a reverse orientation (N-terminal in the inside, C-terminal in the outside of the membrane ) and were therefore inferred to be heliorhodopsins (HeRs). Both proton-pumping rhodopsins and HeRs were identified predominantly in Saccharimonadia. HeRs had 80 occurrences in Saccharimonadia, 7 in Dojkabacteria, 3 in ABY1 and less in other classes, while we identified 15 proton-pumping rhodopsin sequences in Saccharimonadia, 1 in Gracilibacteria and another 1 in Paceibacteria (Supplementary Table S15). While checking for the conserved lysine in the transmembrane helix 7 that is required for retinal binding in HeRs, we observed that the most common motifs were SLVAK, SLIAK and SFVAK, the interchangeable amino acids belonging to the same group of compounds with hydrophobic side chains (Supplementary Table S15).