Study site and sample collection
This study was part of an ongoing consortium project that focused on a primary-scale survey of soil chemistry and microbiology across a range of regional and climatic zones in sub-Saharan Africa38.
In Kenya, a microbiome survey of the soils across selected forest ecosystems was carried out based on a census for forest regions (http://kws.go.ke/content/overview-0). Data capture at each sampling site included GPS location, elevation, vegetation at the time of sample collection, slope, general soil description and general site description. ArcGIS 10.8.1 (Environmental Systems Research Institute software application, 2020) was used to visualize and display the sample sites (Fig. 7). The distribution and characteristics of the selected forests used in this study are summarized in Supplementary Table 1.
Sampling was done by recovering 4 x 200 g topsoil samples (0–5 cm depth) at approximately 50 m spacing at each site. Each working sample was obtained by scooping a composite of 4 x 50 g pseudo-replicate samples, recovered from the corners of a one square meter virtual quadrat. Each sample was collected in a separate labelled Whirl Pak bags and stored at 4oC prior to shipment to University of Pretoria (South Africa) for nucleic acid extraction and soil physicochemical analysis.
These samples were later grouped into regions depending on geographical location on the Kenyan map as follows: Aberdare (Sample K23, K33, K34, K63 and K77); Mt. Kenya (K35, K36, K37, K38, K39, K40, K42 and K66); Nairobi (K15, K16, K29, K70 and K71); Taita Taveta (K5, K6, K7, K8, K9 and K10) and Western region (K18, K21, K24, K25, K26, K27 and K28).
Soil Physicochemical Characteristics
Soil physicochemical characteristics (Supplementary Table 1) were determined using protocols outlined by AgriLASA (2004). Soil pH was measured using the slurry method at a 1:2.5 soil/water ratio, and the pH of the supernatant was recorded with a calibrated bench top pH meter (Crison Basic, + 20, Crison, Barcelona, Spain). The concentrations of soluble and exchangeable of sodium (Na), potassium (K), carbon (C), magnesium (Mg), and phosphorus (P) were determined using Mehlich 3 test 84. The extractable ion concentration was quantified using ICP-OES (Inductively Coupled Plasma Optical Emission Spectrometry, Spectro Genesis, SPECTRO Analytical Instruments GmbH & Co. KG, Germany). Soil particle size distribution (sand/silt/clay percent) was measured using the Bouyoucos method 85. Total nitrogen (TN) and soil organic carbon (TOC) were measured using the catalyzed high temperature combustion method (Dumas method) 86. The Enhanced Vegetation Index-2 (EVI2) was obtained from the NASA Land Processes Distributed Active Archive Center’s (LP DAAC) VIIRS Vegetation Indices dataset 87 at a 500-meter resolution.
Prokaryotic Dna Extraction And 16srrna Gene Sequencing
Total DNA was extracted from soil samples using the DNeasy PowerSoil Kit (QIAGEN, USA) following the manufacturer's instructions with the following modifications; the elution buffer C6 was pre-heated to 55°C for 10 minutes before the final elution step, and the DNA was eluted using 70 µl of the elution buffer. After extraction, DNA concentration and purity were checked using the Nanodrop 2000 (ThermoFisher, USA) and agarose gel electrophoresis. The DNA samples were sent to MRDNA laboratories (www.mrdnalab.com, Shallowater, TX, USA) for sequencing of the V4/V5 16S rRNA gene, using the 515F (5'-GTGYCAGCMGCCGCGGTAA-3') and 909 R (5'-CCCCGYCAATTCMTTTRAGT-3') primers, according to88 38. Before library preparation, the regions of interest were amplified using the HotStarTaq Plus Master Mix Kit (Qiagen, USA) and subsequently purified using calibrated Ampure XP beads (Beckman Coulter Life Sciences, USA). Sequencing was performed at MR DNA (www.mrdnalab.com, Shallowater, TX, USA) on MiSeq instrument following the manufacturer’s guidelines.
Sequence Analysis And Taxonomic Classification
The generated raw amplicon sequence reads were filtered, trimmed, and clustered into unique amplicon sequence variants (ASVs) using the QIIME2 pipeline 89. Briefly, raw sequences were demultiplexed, quality checked and a feature table constructed using Divisive Amplicon Denoising Algorithm 2 (DADA2) pipeline 90 inbuilt within QIIME2 91. The raw sequences were denoised and chimeras removed. Sequences which were < 200 base pairs after phred20- base quality trimming, with ambiguous base calls, and those with homopolymer runs exceeding 6bp, were removed. The forward and reverse reads were truncated at 324 base pairs. This was followed by calculation of denoising statistics, picking of representative sequences and creation of ASVs feature table. Sequencing processing resulted in a total of 1,944,316 high quality sequence reads, which were clustered into 41,901 ASVs at 3% genetic distance.
Representative sequences were aligned using MAFFT 92 and highly variable regions were masked to reduce the noise in phylogenetic analysis 93. Phylogenetic trees were created and rooted at midpoint 94 on QIIME2. Taxonomic classification of ASVs was done using QIIME feature-classifier 91 against the untrained SILVA 138.1 (release 2022.2) 95,96. Demultiplexed high-quality sequence reads were deposited in the National Centre for Biotechnology Information (NCBI) Sequence Read Archive (SRA), as Bio Project ID: PRJNA851255 and study accession numbers available for download at http://www.ncbi.nlm.nih.gov/bioproject/851255.
Data processing of amplicon datasets from other countries.
Sequence datasets from selected forests around the globe were downloaded from publicly available databases (accession numbers at Supplementary Table 3) and processed using the QIIME2 pipeline as described above. Raw reads from the downloaded datasets spanned the 16S rRNA gene hypervariable regions v3-v4, v4, and v4-v5, depending on the study. To keep the sample sizes between countries comparable, a subset of between 28 to 30 samples was chosen for each dataset. To accommodate the variable quality scores of the different datasets, quality threshold was set to 20 and all reads were truncated at 220 bps. After DADA2 processing, the resulting representative sequence file and ASV table were merged with the Kenyan dataset. Read counts for the combined dataset ranged from 10877 to 346157 reads (Supplementary Figure S1). The merged representative sequence file was taxonomically annotated using the un-trained SILVA database 138.1 (release 2022.2), 95,96.
Statistical analysis
ASVs from QIIME2 were modified for use with the package phyloseq (version 1.36.0) in RStudio 97,98. The taxonomy table was merged with the feature table, and the relative abundance and bar plots were plotted using the ggplot package (version 3.3.5) 97,98. The normality of the dataset was first tested with the Shapiro-Wilk test 99. The Kruskal-Wallis Rank Sum test 100 was subsequently used to calculate the significance of mean differences in soil variables between forest soil samples (adj. p. value < 0.01). Tukey post hoc analysis test 101 were used to compare significant differences between regions where soil environmental variables were normally distributed (adj. p. value < 0.01).
Significant differences in soil physicochemical characteristics were calculated using the stats package version 3.6.2 102 in RStudio version 4.0.3 103. The distribution of soil physicochemical variables across different forest sites was calculated on log-standardized data using the “decostand” function 104 from vegan package (version 2.5.7) 105, which performs principal component analysis of the data (PCA) 104. The resulting distance matrix between samples was plotted in a PCA graph, with the projected direction and magnitude of the distribution for each variable plotted in a separate loading plot. The hmisc (version 4.5) package 102 was subsequently used to calculate strong significant Pearson correlations 100 between variables (adj. p-value < 0.01), which were plotted in a bubble graph using the corrplot (version 0.9) package 102.Biodiversity metrics (alpha diversity) and community structure dissimilarity (beta diversity) were calculated using the vegan (version 2.5.7) 105 and phyloseq (version 1.16.2) 98 packages in RStudio. Observed richness, Inverse Simson 106 and the Shannon indexes 107 were used as metrics for alpha-diversity 106. The prokaryotic ASV table was split into Archaea and Bacteria using the “subset_taxa” function in phyloseq before calculating the diversity indexes. Differences in alpha-diversity between designated regions were assessed as described for the soil physicochemical variables. Beta-diversity index of each soil sample was calculated from the log-transformed ASV tables using the “vegdist” function in vegan, based on Bray-Cutis distance estimation method 108. Ordination of the beta-diversity scores was plotted on a principal component analysis plot (PCoA) 109, and the significance of beta-diversity dissimilarity between forest regions was calculated using Permutational Multivariate Analyses of Variance (PERMANOVA) 110 with 999 permutations. Comparison of beta-diversity distribution between the samples of different countries datasets was also performed using the methodology described above.
The environmental drivers of prokaryotic community structure were estimated using Redundancy analysis (RDA) 111. The soil physicochemical dataset was z-score standardized and tested for multicollinearity using the “vif” function from the car (version 3.0.11) package 112. The best models for explanatory variables were calculated using forward step-wise regression model selection method with the ordistep() function in the vegan package, with 1000 permutations, and significant variables with vif values above 10 were removed The significance of the best fitted models and each predictor variable in the model were calculated using the ANOVA permutation test, 113 with 1000 permutations. The relative taxonomic abundances of prokaryotic taxa were compared between regions using Linear Discriminant Analysis (LDA) effect size (LEfSe) algorithm 114 for high-dimensional biomarker discovery and explanation of differentially abundant organisms. This analysis was implemented using the package Microbiome Marker in RStudio 115. Differences were analyzed using Kruskal-Wallis sum-rank test 116 to detect significant differentially abundant taxa at genus level (adj. p. value < 0.01). The biological consistency was investigated using a set of pairwise tests among genera using the Wilcoxon rank-sum test 117 118, with an LDA threshold of 2. The same LDA method was used to detect differently abundant taxa across the country datasets.