We analyzed 106 metagenomes [45–48] from 30 stations distributed along the Amazon River basin, with an average coverage of 5.0x109 base pairs per metagenome (Supplementary Table 1, Additional file 1).. The stations from the Solimões River and lakes in the Amazon River course, located upstream from the city of Manaus, until the Amazon River’s plume in the Atlantic Ocean covered ~2,106 km and were divided into 5 sections (Figure 1a; Supplementary Table 1, Additional file 1).. These sections were: 1) Upstream section (upstream Manaus city); 2) Downstream section (placed between Manaus and the start of the Amazon River estuary. It includes the influx of particle-rich white waters from the Solimões River as well as the influx of humic waters from Negro River [49,50]), 3) Estuary section (part of the river that meets the Atlantic Ocean) and 4) Plume section (the area where the Ocean is influenced by the Amazon River inputs).
Samples were taken as previously indicated [45–48]. Depending on the original study, particle-associated microbes were defined as those passing the filter of 300 µm mesh-size and being retained in the filter of 2 - 5 µm mesh-size. Free-living microbes were defined as those passing the filter of 2 - 5 µm mesh-size, being retained in the filter of 0.2 µm mesh-size. DNA was extracted from the filters as indicated in the original studies [45–48]. Metagenomes were obtained from libraries prepared with either Nextera or TruSeq kits. Different Illumina sequencing platforms were used: Genome Analyzer IIx, HiSeq 2500 or MiSeq. Additional information is provided in Supplementary Table 1, Additional file 1.
TeOM degradation machinery
To investigate TeOM degradation, we grouped samples by river section and assessed their gene content. Genes were then searched against reference sequences and protein families involved in TeOM degradation (see Supplementary Table 4, Additional file 1).. In particular, lignin degradation starts with extracellular polymer oxidation followed by internalization and metabolism of the produced monomers or dimers by bacteria. Protein families related to lignin oxidation (PF05870, PF07250, PF11895, PF04261 and PF02578) were searched among PFAM-annotated genes. The genes related to the metabolism of lignin-derived aromatic compounds were annotated with Diamond (Blastp search mode; v.0.9.22) [72], with query coverage ≥ 50%, protein identity ≥ 40% and e-value ≤ 1e–5 as recommended by Kamimura et al. [36], using their dataset as reference.
Cellulose and hemicellulose degradation involve glycosyl hydrolases (GH). The most common cellulolytic protein families (GH1, GH3, GH5, GH6, GH8, GH9, GH12, GH45, GH48, GH51 and GH74) [78] and cellulose-binding motifs (CBM1, CBM2, CBM3, CBM6, CBM8, CBM30 and CBM44) [78,79] were searched in PFAM annotations. In addition, the most common hemicellulolytic families (GH2, GH10, GH11, GH16, GH26, GH30, GH31, GH39, GH42, GH43 and GH53) [79] were searched in the PFAM database. Lytic polysaccharide monooxygenases (LPMO) [79] were also identified using PFAM to investigate the simultaneous deconstruction of cellulose and hemicellulose.
During the degradation of refractory and labile material by exoenzymes, microbes produce a complex mix of particulate and dissolved organic carbon. The use of this mix is mediated by a vast diversity of transporter systems [37]. The typical transporters associated to lignin degradation (MFS transporter, AAHS family, ABC transporters, MHS family, ITS superfamily and TRAP transporter) were searched with Diamond (v.0.9.22) [72], using query coverage ≥ 50%, protein identity ≥ 40% and e-value ≤ 1e–5 and a reference dataset [36].
Similarly to the fate of hemi-/cellulose degradation byproducts, lignin degradation ends up in the production of 4-carboxy–4hydroxy–2-oxoadipate, which is converted into pyruvate or oxaloacetate, both substrates of the tricarboxylic acid cycle (TCA) [36]. Recently, several substrate binding proteins (TctC) belonging to the tripartite tricarboxylate transporter (TTT) system were associated to the transport of TeOM degradation byproducts, like adipate [39] and terephthalate [40]. To investigate the metabolism of these compounds, and the possible link between the TTT system and lignin/cellulose degradation, the protein families TctA (PF01970), TctB (PF07331) and TctC (PF03401) were searched in PFAM.
The genes found using the above-mentioned strategy were submitted to PSORT v.3.0 [80], to determine the protein subcellular localization (cytoplasm, secreted to the outside, inner membrane, periplasm, or outer membrane). We carried out predictions in the three possible taxa (Gram negative, Gram positive and Archaea), and the best score was used to determine the subcellular localization. Genes assigned to an “unknown” location, as well as those with a wrong assignment were eliminated (for example, genes known to work in extracellular space that were assigned to the cytoplasmic membrane).
The total amount of TeOM degradation genes found per function (lignin oxidation, transport, hemi-/cellulose degradation and lignin-derived aromatic compounds metabolism) in each section of the river, were normalized by the total gene counts per metagenome. Subsequently, correlograms were produced using Pearson’s correlation coefficients with the R packages Corrplot [81] and RColorBrewer [82]. The linear geographic distance of each metagenome to the Amazon River source (i.e. Mantaro River, Peru, 10° 43′ 55″ S / 76° 38′ 52″ W), was also used in this analysis to infer changes in gene count along the Amazon River course. Geographic distance was calculated with the R package Fields [83].